1. 16 May, 2020 2 commits
  2. 14 May, 2020 7 commits
    • Satya Tangirala's avatar
      block: blk-crypto-fallback for Inline Encryption · 488f6682
      Satya Tangirala authored
      Blk-crypto delegates crypto operations to inline encryption hardware
      when available. The separately configurable blk-crypto-fallback contains
      a software fallback to the kernel crypto API - when enabled, blk-crypto
      will use this fallback for en/decryption when inline encryption hardware
      is not available.
      
      This lets upper layers not have to worry about whether or not the
      underlying device has support for inline encryption before deciding to
      specify an encryption context for a bio. It also allows for testing
      without actual inline encryption hardware - in particular, it makes it
      possible to test the inline encryption code in ext4 and f2fs simply by
      running xfstests with the inlinecrypt mount option, which in turn allows
      for things like the regular upstream regression testing of ext4 to cover
      the inline encryption code paths.
      
      For more details, refer to Documentation/block/inline-encryption.rst.
      Signed-off-by: default avatarSatya Tangirala <satyat@google.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      488f6682
    • Satya Tangirala's avatar
      block: Make blk-integrity preclude hardware inline encryption · d145dc23
      Satya Tangirala authored
      Whenever a device supports blk-integrity, make the kernel pretend that
      the device doesn't support inline encryption (essentially by setting the
      keyslot manager in the request queue to NULL).
      
      There's no hardware currently that supports both integrity and inline
      encryption. However, it seems possible that there will be such hardware
      in the near future (like the NVMe key per I/O support that might support
      both inline encryption and PI).
      
      But properly integrating both features is not trivial, and without
      real hardware that implements both, it is difficult to tell if it will
      be done correctly by the majority of hardware that support both.
      So it seems best not to support both features together right now, and
      to decide what to do at probe time.
      Signed-off-by: default avatarSatya Tangirala <satyat@google.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d145dc23
    • Satya Tangirala's avatar
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala authored
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      Signed-off-by: default avatarSatya Tangirala <satyat@google.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a892c8d5
    • Satya Tangirala's avatar
      block: Keyslot Manager for Inline Encryption · 1b262839
      Satya Tangirala authored
      Inline Encryption hardware allows software to specify an encryption context
      (an encryption key, crypto algorithm, data unit num, data unit size) along
      with a data transfer request to a storage device, and the inline encryption
      hardware will use that context to en/decrypt the data. The inline
      encryption hardware is part of the storage device, and it conceptually sits
      on the data path between system memory and the storage device.
      
      Inline Encryption hardware implementations often function around the
      concept of "keyslots". These implementations often have a limited number
      of "keyslots", each of which can hold a key (we say that a key can be
      "programmed" into a keyslot). Requests made to the storage device may have
      a keyslot and a data unit number associated with them, and the inline
      encryption hardware will en/decrypt the data in the requests using the key
      programmed into that associated keyslot and the data unit number specified
      with the request.
      
      As keyslots are limited, and programming keys may be expensive in many
      implementations, and multiple requests may use exactly the same encryption
      contexts, we introduce a Keyslot Manager to efficiently manage keyslots.
      
      We also introduce a blk_crypto_key, which will represent the key that's
      programmed into keyslots managed by keyslot managers. The keyslot manager
      also functions as the interface that upper layers will use to program keys
      into inline encryption hardware. For more information on the Keyslot
      Manager, refer to documentation found in block/keyslot-manager.c and
      linux/keyslot-manager.h.
      Co-developed-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarSatya Tangirala <satyat@google.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1b262839
    • Satya Tangirala's avatar
      Documentation: Document the blk-crypto framework · 54b259f6
      Satya Tangirala authored
      The blk-crypto framework adds support for inline encryption. There are
      numerous changes throughout the storage stack. This patch documents the
      main design choices in the block layer, the API presented to users of
      the block layer (like fscrypt or layered devices) and the API presented
      to drivers for adding support for inline encryption.
      Signed-off-by: default avatarSatya Tangirala <satyat@google.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      54b259f6
    • Tejun Heo's avatar
      iocost: don't let vrate run wild while there's no saturation signal · 81ca627a
      Tejun Heo authored
      When the QoS targets are met and nothing is being throttled, there's
      no way to tell how saturated the underlying device is - it could be
      almost entirely idle, at the cusp of saturation or anywhere inbetween.
      Given that there's no information, it's best to keep vrate as-is in
      this state.  Before 7cd806a9 ("iocost: improve nr_lagging
      handling"), this was the case - if the device isn't missing QoS
      targets and nothing is being throttled, busy_level was reset to zero.
      
      While fixing nr_lagging handling, 7cd806a9 ("iocost: improve
      nr_lagging handling") broke this.  Now, while the device is hitting
      QoS targets and nothing is being throttled, vrate keeps getting
      adjusted according to the existing busy_level.
      
      This led to vrate keeping climing till it hits max when there's an IO
      issuer with limited request concurrency if the vrate started low.
      vrate starts getting adjusted upwards until the issuer can issue IOs
      w/o being throttled.  From then on, QoS targets keeps getting met and
      nothing on the system needs throttling and vrate keeps getting
      increased due to the existing busy_level.
      
      This patch makes the following changes to the busy_level logic.
      
      * Reset busy_level if nr_shortages is zero to avoid the above
        scenario.
      
      * Make non-zero nr_lagging block lowering nr_level but still clear
        positive busy_level if there's clear non-saturation signal - QoS
        targets are met and nr_shortages is non-zero.  nr_lagging's role is
        preventing adjusting vrate upwards while there are long-running
        commands and it shouldn't keep busy_level positive while there's
        clear non-saturation signal.
      
      * Restructure code for clarity and add comments.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarAndy Newell <newella@fb.com>
      Fixes: 7cd806a9 ("iocost: improve nr_lagging handling")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      81ca627a
    • Ming Lei's avatar
      block: move blk_io_schedule() out of header file · 71ac860a
      Ming Lei authored
      blk_io_schedule() isn't called from performance sensitive code path, and
      it is easier to maintain by exporting it as symbol.
      
      Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
      to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Fixes: e6249cdd ("block: add blk_io_schedule() for avoiding task hung in sync dio")
      Reported-by: default avatarSatya Tangirala <satyat@google.com>
      Tested-by: default avatarSatya Tangirala <satyat@google.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      71ac860a
  3. 13 May, 2020 16 commits
  4. 11 May, 2020 1 commit
  5. 09 May, 2020 14 commits
    • Christoph Hellwig's avatar
      hfs: stop using ioctl_by_bdev · af00423a
      Christoph Hellwig authored
      Instead just call the CDROM layer functionality directly.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af00423a
    • Christoph Hellwig's avatar
      bdi: remove the name field in struct backing_dev_info · 1cd925d5
      Christoph Hellwig authored
      The name is only printed for a not registered bdi in writeback.  Use the
      device name there as is more useful anyway for the unlike case that the
      warning triggers.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1cd925d5
    • Christoph Hellwig's avatar
      bdi: simplify bdi_alloc · aef33c2f
      Christoph Hellwig authored
      Merge the _node vs normal version and drop the superflous gfp_t argument.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aef33c2f
    • Christoph Hellwig's avatar
      bdi: remove bdi_register_owner · 3c5d202b
      Christoph Hellwig authored
      Split out a new bdi_set_owner helper to set the owner, and move the policy
      for creating the bdi name back into genhd.c, where it belongs.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3c5d202b
    • Christoph Hellwig's avatar
      bdi: unexport bdi_register_va · a5a6c66d
      Christoph Hellwig authored
      bdi_register_va is only used by super.c, which can't be modular.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a5a6c66d
    • Christoph Hellwig's avatar
      driver core: remove device_create_vargs · 4c747466
      Christoph Hellwig authored
      All external users of device_create_vargs are gone, so remove it and
      open code it in the only caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4c747466
    • Weiping Zhang's avatar
      block: rename blk_mq_alloc_rq_maps · 79fab528
      Weiping Zhang authored
      rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests,
      this function allocs both map and request, make function name align
      with funtion.
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79fab528
    • Weiping Zhang's avatar
      block: rename __blk_mq_alloc_rq_map · 03b63b02
      Weiping Zhang authored
      rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request,
      actually it alloc both map and request, make function name
      align with function.
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      03b63b02
    • Ming Lei's avatar
      block: alloc map and request for new hardware queue · fd689871
      Ming Lei authored
      Alloc new map and request for new hardware queue when increse
      hardware queue count. Before this patch, it will show a
      warning for each new hardware queue, but it's not enough, these
      hctx have no maps and reqeust, when a bio was mapped to these
      hardware queue, it will trigger kernel panic when get request
      from these hctx.
      
      Test environment:
       * A NVMe disk supports 128 io queues
       * 96 cpus in system
      
      A corner case can always trigger this panic, there are 96
      io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
      log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
      queues to 96, then nvme will alloc others(32) queues for read, but
      blk_mq_update_nr_hw_queues does not alloc map and request for these new
      added io queues. So when process read nvme disk, it will trigger kernel
      panic when get request from these hardware context.
      
      Reproduce script:
      
      nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
      echo $nr > /sys/module/nvme/parameters/write_queues
      echo 1 > /sys/block/nvme0n1/device/reset_controller
      dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
      
      [ 8040.805626] ------------[ cut here ]------------
      [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
      ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
      tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
      cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
      rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      [ 8040.805637]  ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
      [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
      [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
      [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
      [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
      [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
      [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
      [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
      [ 8040.805645] FS:  0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
      [ 8040.805646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
      [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8040.805647] PKRU: 55555554
      [ 8040.805647] Call Trace:
      [ 8040.805649]  blk_mq_update_nr_hw_queues+0x31b/0x390
      [ 8040.805650]  nvme_reset_work+0xb4b/0xeab [nvme]
      [ 8040.805651]  process_one_work+0x1a7/0x370
      [ 8040.805652]  worker_thread+0x1c9/0x380
      [ 8040.805653]  ? max_active_store+0x80/0x80
      [ 8040.805655]  kthread+0x112/0x130
      [ 8040.805656]  ? __kthread_parkme+0x70/0x70
      [ 8040.805657]  ret_from_fork+0x35/0x40
      [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
      [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
      [ 8229.365165] #PF: supervisor read access in kernel mode
      [ 8229.365178] #PF: error_code(0x0000) - not-present page
      [ 8229.365191] PGD 0 P4D 0
      [ 8229.365201] Oops: 0000 [#1] SMP PTI
      [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
      [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
      [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
      [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
      [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
      [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
      [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
      [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
      [ 8229.365397] FS:  00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
      [ 8229.365415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
      [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8229.365476] PKRU: 55555554
      [ 8229.365484] Call Trace:
      [ 8229.365498]  ? finish_wait+0x80/0x80
      [ 8229.365512]  blk_mq_get_request+0xcb/0x3f0
      [ 8229.365525]  blk_mq_make_request+0x143/0x5d0
      [ 8229.365538]  generic_make_request+0xcf/0x310
      [ 8229.365553]  ? scan_shadow_nodes+0x30/0x30
      [ 8229.365564]  submit_bio+0x3c/0x150
      [ 8229.365576]  mpage_readpages+0x163/0x1a0
      [ 8229.365588]  ? blkdev_direct_IO+0x490/0x490
      [ 8229.365601]  read_pages+0x6b/0x190
      [ 8229.365612]  __do_page_cache_readahead+0x1c1/0x1e0
      [ 8229.365626]  ondemand_readahead+0x182/0x2f0
      [ 8229.365639]  generic_file_buffered_read+0x590/0xab0
      [ 8229.365655]  new_sync_read+0x12a/0x1c0
      [ 8229.365666]  vfs_read+0x8a/0x140
      [ 8229.365676]  ksys_read+0x59/0xd0
      [ 8229.365688]  do_syscall_64+0x55/0x1d0
      [ 8229.365700]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Tested-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd689871
    • Weiping Zhang's avatar
      block: save previous hardware queue count before udpate · a2584e43
      Weiping Zhang authored
      blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so
      save old set->nr_hw_queues before call this function.
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2584e43
    • Weiping Zhang's avatar
      block: free both rq_map and request · 2e194422
      Weiping Zhang authored
      Allocation:
      
      __blk_mq_alloc_rq_map
      	blk_mq_alloc_rq_map
      		blk_mq_alloc_rq_map
      			tags = blk_mq_init_tags : kzalloc_node:
      			tags->rqs = kcalloc_node
      			tags->static_rqs = kcalloc_node
      	blk_mq_alloc_rqs
      		p = alloc_pages_node
      		tags->static_rqs[i] = p + offset;
      
      Free:
      
      blk_mq_free_rq_map
      	kfree(tags->rqs);
      	kfree(tags->static_rqs);
      	blk_mq_free_tags
      		kfree(tags);
      
      The page allocated in blk_mq_alloc_rqs cannot be released,
      so we should use blk_mq_free_map_and_requests here.
      
      blk_mq_free_map_and_requests
      	blk_mq_free_rqs
      		__free_pages : cleanup for blk_mq_alloc_rqs
      	blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2e194422
    • Jens Axboe's avatar
      Merge branch 'block-5.7' into for-5.8/block · 873f1c8d
      Jens Axboe authored
      Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
      the blk-iocost changes, but we also need the base of the bdi
      use-after-free as well as we build on top of it.
      
      * block-5.7:
        nvme: fix possible hang when ns scanning fails during error recovery
        nvme-pci: fix "slimmer CQ head update"
        bdi: add a ->dev_name field to struct backing_dev_info
        bdi: use bdi_dev_name() to get device name
        bdi: move bdi_dev_name out of line
        vboxsf: don't use the source name in the bdi name
        iocost: protect iocg->abs_vdebt with iocg->waitq.lock
        block: remove the bd_openers checks in blk_drop_partitions
        nvme: prevent double free in nvme_alloc_ns() error handling
        null_blk: Cleanup zoned device initialization
        null_blk: Fix zoned command handling
        block: remove unused header
        blk-iocost: Fix error on iocost_ioc_vrate_adj
        bdev: Reduce time holding bd_mutex in sync in blkdev_close()
        buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      873f1c8d
    • Sagi Grimberg's avatar
      nvme: fix possible hang when ns scanning fails during error recovery · 59c7c3ca
      Sagi Grimberg authored
      When the controller is reconnecting, the host fails I/O and admin
      commands as the host cannot reach the controller. ns scanning may
      revalidate namespaces during that period and it is wrong to remove
      namespaces due to these failures as we may hang (see 205da243).
      
      One command that may fail is nvme_identify_ns_descs. Since we return
      success due to having ns identify descriptor list optional, we continue
      to compare ns identifiers in nvme_revalidate_disk, obviously fail and
      return -ENODEV to nvme_validate_ns, which will remove the namespace.
      
      Exactly what we don't want to happen.
      
      Fixes: 22802bf7 ("nvme: Namepace identification descriptor list is optional")
      Tested-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59c7c3ca
    • Alexey Dobriyan's avatar
      nvme-pci: fix "slimmer CQ head update" · a8de6639
      Alexey Dobriyan authored
      Pre-incrementing ->cq_head can't be done in memory because OOB value
      can be observed by another context.
      
      This devalues space savings compared to original code :-\
      
      	$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
      	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32)
      	Function                                     old     new   delta
      	nvme_poll_irqdisable                         464     456      -8
      	nvme_poll                                    455     447      -8
      	nvme_irq                                     388     380      -8
      	nvme_dev_disable                             955     947      -8
      
      But the code is minimal now: one read for head, one read for q_depth,
      one increment, one comparison, single instruction phase bit update and
      one write for new head.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Reported-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: default avatarJohn Garry <john.garry@huawei.com>
      Fixes: e2a366a4 ("nvme-pci: slimmer CQ head update")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8de6639