1. 28 Jun, 2019 21 commits
    • Coly Li's avatar
      bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread · f54d801d
      Coly Li authored
      Commit 9baf3097 ("bcache: fix for gc and write-back race") added a
      new work queue dc->writeback_write_wq, but forgot to destroy it in the
      error condition when creating dc->writeback_thread failed.
      
      This patch destroys dc->writeback_write_wq if kthread_create() returns
      error pointer to dc->writeback_thread, then a memory leak is avoided.
      
      Fixes: 9baf3097 ("bcache: fix for gc and write-back race")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f54d801d
    • Coly Li's avatar
      bcache: fix mistaken sysfs entry for io_error counter · 54619998
      Coly Li authored
      In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
      incorrectly inserted in. The correct entry should be sysfs_io_errors.
      
      This patch fixes the problem and now I/O errors of cached device can be
      read from /sys/block/bcache<N>/bcache/io_errors.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      54619998
    • Coly Li's avatar
      bcache: add pendings_cleanup to stop pending bcache device · 0c277e21
      Coly Li authored
      If a bcache device is in dirty state and its cache set is not
      registered, this bcache device will not appear in /dev/bcache<N>,
      and there is no way to stop it or remove the bcache kernel module.
      
      This is an as-designed behavior, but sometimes people has to reboot
      whole system to release or stop the pending backing device.
      
      This sysfs interface may remove such pending bcache devices when
      write anything into the sysfs file manually.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0c277e21
    • Coly Li's avatar
      bcache: make bset_search_tree() be more understandable · 944a4f34
      Coly Li authored
      The purpose of following code in bset_search_tree() is to avoid a branch
      instruction,
       994         if (likely(f->exponent != 127))
       995                 n = j * 2 + (((unsigned int)
       996                               (f->mantissa -
       997                                bfloat_mantissa(search, f))) >> 31);
       998         else
       999                 n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
      1000                         ? j * 2
      1001                         : j * 2 + 1;
      
      This piece of code is not very clear to understand, even when I tried to
      add code comment for it, I made mistake. This patch removes the implict
      bit operation and uses explicit branch to calculate next location in
      binary tree search.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      944a4f34
    • Coly Li's avatar
      bcache: remove "XXX:" comment line from run_cache_set() · 68a53c95
      Coly Li authored
      In previous bcache patches for Linux v5.2, the failure code path of
      run_cache_set() is tested and fixed. So now the following comment
      line can be removed from run_cache_set(),
      	/* XXX: test this, it's broken */
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68a53c95
    • Coly Li's avatar
      bcache: improve error message in bch_cached_dev_run() · e0faa3d7
      Coly Li authored
      This patch adds more error message in bch_cached_dev_run() to indicate
      the exact reason why an error value is returned. Please notice when
      printing out the "is running already" message, pr_info() is used here,
      because in this case also -EBUSY is returned, the bcache device can
      continue to attach to the cache devince and run, so it won't be an
      error level message in kernel message.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e0faa3d7
    • Coly Li's avatar
      bcache: add more error message in bch_cached_dev_attach() · 633bb2ce
      Coly Li authored
      This patch adds more error message for attaching cached device, this is
      helpful to debug code failure during bache device start up.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      633bb2ce
    • Coly Li's avatar
      bcache: more detailed error message to bcache_device_link() · 4b6efb4b
      Coly Li authored
      This patch adds more accurate error message for specific
      ssyfs_create_link() call, to help debugging failure during
      bcache device start tup.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b6efb4b
    • Coly Li's avatar
      bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() · 383ff218
      Coly Li authored
      When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
      bit is set, bch_journal() may continue to work because the journaling
      bkey might be still in write set yet. The caller of bch_journal() may
      believe the journal still work but the truth is in-memory journal write
      set won't be written into cache device any more. This behavior may
      introduce potential inconsistent metadata status.
      
      This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
      if the bit is set, bch_journal() returns NULL immediately to notice
      caller to know journal does not work.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      383ff218
    • Coly Li's avatar
      bcache: check CACHE_SET_IO_DISABLE in allocator code · e775339e
      Coly Li authored
      If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
      errors, currently allocator routines can still continue allocate
      space which may introduce inconsistent metadata state.
      
      This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
      routines,
      - bch_bucket_alloc()
      - __bch_bucket_alloc_set()
      Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
      may reject allocation request earlier to avoid potential inconsistent
      metadata.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e775339e
    • Coly Li's avatar
      bcache: remove unncessary code in bch_btree_keys_init() · bd9026c8
      Coly Li authored
      Function bch_btree_keys_init() initializes b->set[].size and
      b->set[].data to zero. As the code comments indicates, these code indeed
      is unncessary, because both struct btree_keys and struct bset_tree are
      nested embedded into struct btree, when struct btree is filled with 0
      bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
      b->set[].data are initialized to 0 (a.k.a NULL) already.
      
      This patch removes the redundant code, and add comments in
      bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bd9026c8
    • Coly Li's avatar
      bcache: add return value check to bch_cached_dev_run() · 0b13efec
      Coly Li authored
      This patch adds return value check to bch_cached_dev_run(), now if there
      is error happens inside bch_cached_dev_run(), it can be catched.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0b13efec
    • Alexandru Ardelean's avatar
      bcache: use sysfs_match_string() instead of __sysfs_match_string() · 89e0341a
      Alexandru Ardelean authored
      The arrays (of strings) that are passed to __sysfs_match_string() are
      static, so use sysfs_match_string() which does an implicit ARRAY_SIZE()
      over these arrays.
      
      Functionally, this doesn't change anything.
      The change is more cosmetic.
      
      It only shrinks the static arrays by 1 byte each.
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89e0341a
    • Coly Li's avatar
      bcache: remove unnecessary prefetch() in bset_search_tree() · f960facb
      Coly Li authored
      In function bset_search_tree(), when p >= t->size, t->tree[0] will be
      prefetched by the following code piece,
       974                 unsigned int p = n << 4;
       975
       976                 p &= ((int) (p - t->size)) >> 31;
       977
       978                 prefetch(&t->tree[p]);
      
      The purpose of the above code is to avoid a branch instruction, but
      when p >= t->size, prefetch(&t->tree[0]) has no positive performance
      contribution at all. This patch avoids the unncessary prefetch by only
      calling prefetch() when p < t->size.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f960facb
    • Coly Li's avatar
      bcache: add io error counting in write_bdev_super_endio() · 08ec1e62
      Coly Li authored
      When backing device super block is written by bch_write_bdev_super(),
      the bio complete callback write_bdev_super_endio() simply ignores I/O
      status. Indeed such write request also contribute to backing device
      health status if the request failed.
      
      This patch checkes bio->bi_status in write_bdev_super_endio(), if there
      is error, bch_count_backing_io_errors() will be called to count an I/O
      error to dc->io_errors.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08ec1e62
    • Coly Li's avatar
      bcache: ignore read-ahead request failure on backing device · 578df99b
      Coly Li authored
      When md raid device (e.g. raid456) is used as backing device, read-ahead
      requests on a degrading and recovering md raid device might be failured
      immediately by md raid code, but indeed this md raid array can still be
      read or write for normal I/O requests. Therefore such failed read-ahead
      request are not real hardware failure. Further more, after degrading and
      recovering accomplished, read-ahead requests will be handled by md raid
      array again.
      
      For such condition, I/O failures of read-ahead requests don't indicate
      real health status (because normal I/O still be served), they should not
      be counted into I/O error counter dc->io_errors.
      
      Since there is no simple way to detect whether the backing divice is a
      md raid device, this patch simply ignores I/O failures for read-ahead
      bios on backing device, to avoid bogus backing device failure on a
      degrading md raid array.
      Suggested-and-tested-by: default avatarThorsten Knabe <linux@thorsten-knabe.de>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      578df99b
    • Coly Li's avatar
      bcache: avoid flushing btree node in cache_set_flush() if io disabled · e6dcbd3e
      Coly Li authored
      When cache_set_flush() is called for too many I/O errors detected on
      cache device and the cache set is retiring, inside the function it
      doesn't make sense to flushing cached btree nodes from c->btree_cache
      because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os
      onto cache device will be rejected.
      
      This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE
      is set. If yes, then avoids to flush the cached btree nodes to reduce
      more time and make cache set retiring more faster.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e6dcbd3e
    • Coly Li's avatar
      Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" · 695277f1
      Coly Li authored
      This reverts commit 6147305c.
      
      Although this patch helps the failed bcache device to stop faster when
      too many I/O errors detected on corresponding cached device, setting
      CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
      operation will disable all I/Os on cache set, which means other attached
      bcache devices won't work neither.
      
      Without this patch, the failed bcache device can also be stopped
      eventually if internal I/O accomplished (e.g. writeback). Therefore here
      I revert it.
      
      Fixes: 6147305c ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
      Reported-by: default avatarYong Li <mr.liyong@qq.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      695277f1
    • Coly Li's avatar
      bcache: fix return value error in bch_journal_read() · 0ae49cb7
      Coly Li authored
      When everything is OK in bch_journal_read(), finally the return value
      is returned by,
      	return ret;
      which assumes ret will be 0 here. This assumption is wrong when all
      journal buckets as are full and filled with valid journal entries. In
      such cache the last location referencess read_bucket() sets 'ret' to
      1, which means new jset added into jset list. The jset list is list
      'journal' in caller run_cache_set().
      
      Return 1 to run_cache_set() means something wrong and the cache set
      won't start, but indeed everything is OK.
      
      This patch changes the line at end of bch_journal_read() to directly
      return 0 since everything if verything is good. Then a bogus error
      is fixed.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0ae49cb7
    • Coly Li's avatar
      bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush() · b387e9b5
      Coly Li authored
      When system memory is in heavy pressure, bch_gc_thread_start() from
      run_cache_set() may fail due to out of memory. In such condition,
      c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
      failure code path bch_cache_set_error(), when cache_set_flush() gets
      called, the code piece to stop c->gc_thread is broken,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      
      And KASAN catches such NULL pointer deference problem, with the warning
      information:
      
      [  561.207881] ==================================================================
      [  561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
      [  561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313
      
      [  561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G        W         5.0.0-vanilla+ #3
      [  561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.207935] Workqueue: events cache_set_flush [bcache]
      [  561.207940] Call Trace:
      [  561.207948]  dump_stack+0x9a/0xeb
      [  561.207955]  ? kthread_stop+0x3b/0x440
      [  561.207960]  ? kthread_stop+0x3b/0x440
      [  561.207965]  kasan_report+0x176/0x192
      [  561.207973]  ? kthread_stop+0x3b/0x440
      [  561.207981]  kthread_stop+0x3b/0x440
      [  561.207995]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  561.208008]  process_one_work+0x856/0x1620
      [  561.208015]  ? find_held_lock+0x39/0x1d0
      [  561.208028]  ? drain_workqueue+0x380/0x380
      [  561.208048]  worker_thread+0x87/0xb80
      [  561.208058]  ? __kthread_parkme+0xb6/0x180
      [  561.208067]  ? process_one_work+0x1620/0x1620
      [  561.208072]  kthread+0x326/0x3e0
      [  561.208079]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  561.208090]  ret_from_fork+0x3a/0x50
      [  561.208110] ==================================================================
      [  561.208113] Disabling lock debugging due to kernel taint
      [  561.208115] irq event stamp: 11800231
      [  561.208126] hardirqs last  enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
      [  561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [  561.208129] #PF error: [WRITE]
      [  561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
      [  561.312259] softirqs last  enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
      [  561.405975] PGD 0 P4D 0
      [  561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
      [  561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
      [  561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G    B   W         5.0.0-vanilla+ #3
      [  561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.791371] Workqueue: events cache_set_flush [bcache]
      [  561.791374] RIP: 0010:kthread_stop+0x3b/0x440
      [  561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
      [  561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
      [  561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
      [  563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
      [  563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
      [  563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
      [  563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
      [  563.408620] FS:  0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
      [  563.408622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
      [  563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.915796] PKRU: 55555554
      [  563.915797] Call Trace:
      [  563.915807]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  563.915812]  process_one_work+0x856/0x1620
      [  564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.033563]  ? find_held_lock+0x39/0x1d0
      [  564.033567]  ? drain_workqueue+0x380/0x380
      [  564.033574]  worker_thread+0x87/0xb80
      [  564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.118042]  ? __kthread_parkme+0xb6/0x180
      [  564.118046]  ? process_one_work+0x1620/0x1620
      [  564.118048]  kthread+0x326/0x3e0
      [  564.118050]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.252441]  ret_from_fork+0x3a/0x50
      [  564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
      [  564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.348360] CR2: 000000000000001c
      [  564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---
      
      Therefore, it is not enough to only check whether c->gc_thread is NULL,
      we should use IS_ERR_OR_NULL() to check both NULL pointer and error
      value.
      
      This patch changes the above buggy code piece in this way,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b387e9b5
    • Coly Li's avatar
      bcache: don't set max writeback rate if gc is running · 141df8bb
      Coly Li authored
      When gc is running, user space I/O processes may wait inside
      bcache code, so no new I/O coming. Indeed this is not a real idle
      time, maximum writeback rate should not be set in such situation.
      Otherwise a faster writeback thread may compete locks with gc thread
      and makes garbage collection slower, which results a longer I/O
      freeze period.
      
      This patch checks c->gc_mark_valid in set_at_max_writeback_rate(). If
      c->gc_mark_valid is 0 (gc running), set_at_max_writeback_rate() returns
      false, then update_writeback_rate() will not set writeback rate to
      maximum value even c->idle_counter reaches an idle threshold.
      
      Now writeback thread won't interfere gc thread performance.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      141df8bb
  2. 27 Jun, 2019 1 commit
  3. 26 Jun, 2019 3 commits
  4. 25 Jun, 2019 7 commits
    • Paolo Valente's avatar
      block, bfq: re-schedule empty queues if they deserve I/O plugging · 3726112e
      Paolo Valente authored
      Consider, on one side, a bfq_queue Q that remains empty while in
      service, and, on the other side, the pending I/O of bfq_queues that,
      according to their timestamps, have to be served after Q.  If an
      uncontrolled amount of I/O from the latter bfq_queues were dispatched
      while Q is waiting for its new I/O to arrive, then Q's bandwidth
      guarantees would be violated. To prevent this, I/O dispatch is plugged
      until Q receives new I/O (except for a properly controlled amount of
      injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
      for the following reason.
      
      Preemption is performed in two steps. First, Q is expired and
      re-scheduled. Second, the new bfq_queue to serve is chosen. The first
      step is needed by the second, as the second can be performed only
      after Q's timestamps have been properly updated (done in the
      expiration step), and Q has been re-queued for service. This
      dependency is a consequence of the way how BFQ's scheduling algorithm
      is currently implemented.
      
      But Q is not re-scheduled at all in the first step, because Q is
      empty. As a consequence, an uncontrolled amount of I/O may be
      dispatched until Q becomes non empty again. This breaks Q's service
      guarantees.
      
      This commit addresses this issue by re-scheduling Q even if it is
      empty. This in turn breaks the assumption that all scheduled queues
      are non empty. Then a few extra checks are now needed.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3726112e
    • Paolo Valente's avatar
      block, bfq: preempt lower-weight or lower-priority queues · 96a291c3
      Paolo Valente authored
      BFQ enqueues the I/O coming from each process into a separate
      bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
      served for at most timeout_sync milliseconds (default: 125 ms). This
      service scheme is prone to the following inaccuracy.
      
      While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
      receive I/O, and, according to BFQ's scheduling policy, may become the
      right bfq_queue to serve, in place of the currently in-service
      bfq_queue. In this respect, postponing the service of Q2 to after the
      service of Q1 finishes may delay the completion of Q2's I/O, compared
      with an ideal service in which all non-empty bfq_queues are served in
      parallel, and every non-empty bfq_queue is served at a rate
      proportional to the bfq_queue's weight. This additional delay is equal
      at most to the time Q1 may unjustly remain in service before switching
      to Q2.
      
      If Q1 and Q2 have the same weight, then this time is most likely
      negligible compared with the completion time to be guaranteed to Q2's
      I/O. In addition, first, one of the reasons why BFQ may want to serve
      Q1 for a while is that this boosts throughput and, second, serving Q1
      longer reduces BFQ's overhead. As a conclusion, it is usually better
      not to preempt Q1 if both Q1 and Q2 have the same weight.
      
      In contrast, as Q2's weight or priority becomes higher and higher
      compared with that of Q1, the above delay becomes larger and larger,
      compared with the I/O completion times that have to be guaranteed to
      Q2 according to Q2's weight. So reducing this delay may be more
      important than avoiding the costs of preempting Q1.
      
      Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
      higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
      triggers a new choice of the next bfq_queue to serve. If Q2 really is
      the next bfq_queue to serve, then Q2 will be set in service
      immediately.
      
      This change reduces the component of the I/O latency caused by the
      above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
      SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
      a process doing sporadic random reads while another process is doing
      continuous sequential reads.
      Signed-off-by: default avatarNicola Bottura <bottura.nicola95@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      96a291c3
    • Paolo Valente's avatar
      block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4
      Paolo Valente authored
      A bfq_queue Q may happen to be synchronized with another
      bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
      receive new I/O. We call Q2 "waker queue".
      
      If I/O plugging is being performed for Q, and Q is not receiving any
      more I/O because of the above synchronization, then, thanks to BFQ's
      injection mechanism, the waker queue is likely to get served before
      the I/O-plugging timeout fires.
      
      Unfortunately, this fact may not be sufficient to guarantee a high
      throughput during the I/O plugging, because the inject limit for Q may
      be too low to guarantee a lot of injected I/O. In addition, the
      duration of the plugging, i.e., the time before Q finally receives new
      I/O, may not be minimized, because the waker queue may happen to be
      served only after other queues.
      
      To address these issues, this commit introduces the explicit detection
      of the waker queue, and the unconditional injection of a pending I/O
      request of the waker queue on each invocation of
      bfq_dispatch_request().
      
      One may be concerned that this systematic injection of I/O from the
      waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
      the contrary, next Q's I/O is brought forward dramatically, for it is
      not blocked for milliseconds.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13a857a4
    • Paolo Valente's avatar
      block, bfq: bring forward seek&think time update · a3f9bce3
      Paolo Valente authored
      Until the base value for request service times gets finally computed
      for a bfq_queue, the inject limit for that queue does depend on the
      think-time state (short|long) of the queue. A timely update of the
      think time then guarantees a quicker activation or deactivation of the
      injection. Fortunately, the think time of a bfq_queue is updated in
      the same code path as the inject limit; but after the inject limit.
      
      This commits moves the update of the think time before the update of
      the inject limit. For coherence, it moves the update of the seek time
      too.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a3f9bce3
    • Paolo Valente's avatar
      block, bfq: update base request service times when possible · 24792ad0
      Paolo Valente authored
      I/O injection gets reduced if it increases the request service times
      of the victim queue beyond a certain threshold.  The threshold, in its
      turn, is computed as a function of the base service time enjoyed by
      the queue when it undergoes no injection.
      
      As a consequence, for injection to work properly, the above base value
      has to be accurate. In this respect, such a value may vary over
      time. For example, it varies if the size or the spatial locality of
      the I/O requests in the queue change. It is then important to update
      this value whenever possible. This commit performs this update.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      24792ad0
    • Paolo Valente's avatar
      block, bfq: fix rq_in_driver check in bfq_update_inject_limit · db599f9e
      Paolo Valente authored
      One of the cases where the parameters for injection may be updated is
      when there are no more in-flight I/O requests. The number of in-flight
      requests is stored in the field bfqd->rq_in_driver of the descriptor
      bfqd of the device. So, the controlled condition is
      bfqd->rq_in_driver == 0.
      
      Unfortunately, this is wrong because, the instruction that checks this
      condition is in the code path that handles the completion of a
      request, and, in particular, the instruction is executed before
      bfqd->rq_in_driver is decremented in such a code path.
      
      This commit fixes this issue by just replacing 0 with 1 in the
      comparison.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db599f9e
    • Paolo Valente's avatar
      block, bfq: reset inject limit when think-time state changes · 766d6141
      Paolo Valente authored
      Until the base value of the request service times gets finally
      computed for a bfq_queue, the inject limit does depend on the
      think-time state (short|long). The limit must be 0 or 1 if the think
      time is deemed, respectively, as short or long. However, such a check
      and possible limit update is performed only periodically, once per
      second. So, to make the injection mechanism much more reactive, this
      commit performs the update also every time the think-time state
      changes.
      
      In addition, in the following special case, this commit lets the
      inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
      think time is short: bfqq's I/O is synchronized with that of some
      other queue, i.e., bfqq may receive new I/O only after the I/O of the
      other queue is completed. Keeping the inject limit to 1 allows the
      blocking I/O to be served while bfqq is in service. And this is very
      convenient both for bfqq and for the total throughput, as explained
      in detail in the comments in bfq_update_has_short_ttime().
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      766d6141
  5. 24 Jun, 2019 1 commit
    • Jens Axboe's avatar
      Merge branch 'nvme-5.3' of git://git.infradead.org/nvme into for-5.3/block · 6b2c8e52
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "A large chunk of NVMe updates for 5.3.  Highlights:
      
       - improved PCIe suspent support (Keith Busch)
       - error injection support for the admin queue (Akinobu Mita)
       - Fibre Channel discovery improvements (James Smart)
       - tracing improvements including nvmetc tracing support (Minwoo Im)
       - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
         Kulkarni)"
      
      * 'nvme-5.3' of git://git.infradead.org/nvme: (26 commits)
        Documentation: nvme: add an example for nvme fault injection
        nvme: enable to inject errors into admin commands
        nvme: prepare for fault injection into admin commands
        nvmet: introduce target-side trace
        nvme-trace: print result and status in hex format
        nvme-trace: support for fabrics commands in host-side
        nvme-trace: move opcode symbol print to nvme.h
        nvme-trace: do not export nvme_trace_disk_name
        nvme-pci: clean up nvme_remove_dead_ctrl a bit
        nvme-pci: properly report state change failure in nvme_reset_work
        nvme-pci: set the errno on ctrl state change error
        nvme-pci: adjust irq max_vector using num_possible_cpus()
        nvme-pci: remove queue_count_ops for write_queues and poll_queues
        nvme-pci: remove unnecessary zero for static var
        nvme-pci: use host managed power state for suspend
        nvme: introduce nvme_is_fabrics to check fabrics cmd
        nvme: export get and set features
        nvme: fix possible io failures when removing multipathed ns
        nvme-fc: add message when creating new association
        lpfc: add sysfs interface to post NVME RSCN
        ...
      6b2c8e52
  6. 21 Jun, 2019 7 commits