1. 28 Jun, 2019 33 commits
    • Coly Li's avatar
      bcache: add comments for mutex_lock(&b->write_lock) · 41508bb7
      Coly Li authored
      When accessing or modifying BTREE_NODE_dirty bit, it is not always
      necessary to acquire b->write_lock. In bch_btree_cache_free() and
      mca_reap() acquiring b->write_lock is necessary, and this patch adds
      comments to explain why mutex_lock(&b->write_lock) is necessary for
      checking or clearing BTREE_NODE_dirty bit there.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      41508bb7
    • Coly Li's avatar
      bcache: only clear BTREE_NODE_dirty bit when it is set · e5ec5f47
      Coly Li authored
      In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
      always set no matter btree node is dirty or not. The code looks like
      this,
      	if (btree_node_dirty(b))
      		btree_complete_write(b, btree_current_write(b));
      	clear_bit(BTREE_NODE_dirty, &b->flags);
      
      Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
      bit is cleared, then it is unnecessary to clear the bit again.
      
      This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
      true (the bit is set), to save a few CPU cycles.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5ec5f47
    • Coly Li's avatar
      bcache: Revert "bcache: fix high CPU occupancy during journal" · 249a5f6d
      Coly Li authored
      This reverts commit c4dc2497.
      
      This patch enlarges a race between normal btree flush code path and
      flush_btree_write(), which causes deadlock when journal space is
      exhausted. Reverts this patch makes the race window from 128 btree
      nodes to only 1 btree nodes.
      
      Fixes: c4dc2497 ("bcache: fix high CPU occupancy during journal")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Tang Junhui <tang.junhui.linux@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      249a5f6d
    • Coly Li's avatar
      bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" · ba82c1ac
      Coly Li authored
      This reverts commit 6268dc2c.
      
      This patch depends on commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal") which is reverted in previous patch. So
      revert this one too.
      
      Fixes: 6268dc2c ("bcache: free heap cache_set->flush_btree in bch_journal_free")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Shenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba82c1ac
    • Coly Li's avatar
      bcache: shrink btree node cache after bch_btree_check() · 1df3877f
      Coly Li authored
      When cache set starts, bch_btree_check() will check all bkeys on cache
      device by calculating the checksum. This operation will consume a huge
      number of system memory if there are a lot of data cached. Since bcache
      uses its own mca cache to maintain all its read-in btree nodes, and only
      releases the cache space when system memory manage code starts to shrink
      caches. Then before memory manager code to call the mca cache shrinker
      callback, bcache mca cache will compete memory resource with user space
      application, which may have nagive effect to performance of user space
      workloads (e.g. data base, or I/O service of distributed storage node).
      
      This patch tries to call bcache mca shrinker routine to proactively
      release mca cache memory, to decrease the memory pressure of system and
      avoid negative effort of the overall system I/O performance.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1df3877f
    • Coly Li's avatar
      bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() · a231f07a
      Coly Li authored
      In journal_read_bucket() when setting ja->seq[bucket_index], there might
      be potential case that a later non-maximum overwrites a better sequence
      number to ja->seq[bucket_index]. This patch adds a check to make sure
      that ja->seq[bucket_index] will be only set a new value if it is bigger
      then current value.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a231f07a
    • Coly Li's avatar
      bcache: add code comments for journal_read_bucket() · 2464b693
      Coly Li authored
      This patch adds more code comments in journal_read_bucket(), this is an
      effort to make the code to be more understandable.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2464b693
    • Coly Li's avatar
      bcache: fix potential deadlock in cached_def_free() · 7e865eba
      Coly Li authored
      When enable lockdep and reboot system with a writeback mode bcache
      device, the following potential deadlock warning is reported by lockdep
      engine.
      
      [  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
      [  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [  101.542054][  T401]
      [  101.542054][  T401] but task is already holding lock:
      [  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.548386][  T401]
      [  101.548386][  T401] which lock already depends on the new lock.
      [  101.548386][  T401]
      [  101.551874][  T401]
      [  101.551874][  T401] the existing dependency chain (in reverse order) is:
      [  101.555000][  T401]
      [  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [  101.557860][  T401]        process_one_work+0x277/0x640
      [  101.559661][  T401]        worker_thread+0x39/0x3f0
      [  101.561340][  T401]        kthread+0x125/0x140
      [  101.562963][  T401]        ret_from_fork+0x3a/0x50
      [  101.564718][  T401]
      [  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [  101.567701][  T401]        lock_acquire+0xb4/0x1c0
      [  101.569651][  T401]        flush_workqueue+0xae/0x4c0
      [  101.571494][  T401]        drain_workqueue+0xa9/0x180
      [  101.573234][  T401]        destroy_workqueue+0x17/0x250
      [  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
      [  101.577304][  T401]        process_one_work+0x2a4/0x640
      [  101.579357][  T401]        worker_thread+0x39/0x3f0
      [  101.581055][  T401]        kthread+0x125/0x140
      [  101.582709][  T401]        ret_from_fork+0x3a/0x50
      [  101.584592][  T401]
      [  101.584592][  T401] other info that might help us debug this:
      [  101.584592][  T401]
      [  101.588355][  T401]  Possible unsafe locking scenario:
      [  101.588355][  T401]
      [  101.590974][  T401]        CPU0                    CPU1
      [  101.592889][  T401]        ----                    ----
      [  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
      [  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
      [  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
      [  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
      [  101.605255][  T401]
      [  101.605255][  T401]  *** DEADLOCK ***
      [  101.605255][  T401]
      [  101.608310][  T401] 2 locks held by kworker/2:2/401:
      [  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.617480][  T401]
      [  101.617480][  T401] stack backtrace:
      [  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
      [  101.629239][  T401] Call Trace:
      [  101.630360][  T401]  dump_stack+0x85/0xcb
      [  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
      [  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
      [  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
      [  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
      [  101.638421][  T401]  ? find_held_lock+0x34/0xa0
      [  101.640015][  T401]  lock_acquire+0xb4/0x1c0
      [  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.643248][  T401]  flush_workqueue+0xae/0x4c0
      [  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
      [  101.648303][  T401]  drain_workqueue+0xa9/0x180
      [  101.649867][  T401]  destroy_workqueue+0x17/0x250
      [  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
      [  101.653328][  T401]  process_one_work+0x2a4/0x640
      [  101.655029][  T401]  worker_thread+0x39/0x3f0
      [  101.656693][  T401]  ? process_one_work+0x640/0x640
      [  101.658501][  T401]  kthread+0x125/0x140
      [  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
      [  101.661985][  T401]  ret_from_fork+0x3a/0x50
      [  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the above potential deadlock may happen in reboot/shutdown
      code path,
      1) bcache_reboot() is called firstly in the reboot/shutdown code path,
         then in bcache_reboot(), bcache_device_stop() is called.
      2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
         closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
         cached_dev_flush() calls cached_dev_free() via closure_at()
      3) In cached_dev_free(), after stopped writebach kthread
         dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
         destroy_workqueue().
      4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
         drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
         is acquired by lock_map_acquire() in flush_workqueue(). After the
         lock acquired the rest part of flush_workqueue() just wait for the
         workqueue to complete.
      5) Now we look back at writeback thread routine bch_writeback_thread(),
         in the main while-loop, write_dirty() is called via continue_at() in
         read_dirty_submit(), which is called via continue_at() in while-loop
         level called function read_dirty(). Inside write_dirty() it may be
         re-called on workqueeu dc->writeback_write_wq via continue_at().
         It means when the writeback kthread is stopped in cached_dev_free()
         there might be still one kworker queued on dc->writeback_write_wq
         to execute write_dirty() again.
      6) Now this kworker is scheduled on dc->writeback_write_wq to run by
         process_one_work() (which is called by worker_thread()). Before
         calling the kwork routine, wq->lockdep_map is acquired.
      7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
         (lockdep terminology) scenario happens.
      
      Indeed on multiple cores syatem, the above deadlock is very rare to
      happen, just as the code comments in process_one_work() says,
      2263     * AFAICT there is no possible deadlock scenario between the
      2264     * flush_work() and complete() primitives (except for
      	   single-threaded
      2265     * workqueues), so hiding them isn't a problem.
      
      But it is still good to fix such lockdep warning, even no one running
      bcache on single core system.
      
      The fix is simple. This patch solves the above potential deadlock by,
      - Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
      - Flush and destroy dc->writeback_write_wq in writebach kthread routine
        bch_writeback_thread(), where after quit the thread main while-loop
        and before cached_dev_put() is called.
      
      By this fix, dc->writeback_write_wq will be stopped and destroy before
      the writeback kthread stopped, so the chance for a A-A locking on
      wq->lockdep_map is disappeared, such A-A deadlock won't happen
      any more.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7e865eba
    • Coly Li's avatar
      bcache: acquire bch_register_lock later in cached_dev_free() · 80265d8d
      Coly Li authored
      When enable lockdep engine, a lockdep warning can be observed when
      reboot or shutdown system,
      
      [ 3142.764557][    T1] bcache: bcache_reboot() Stopping all devices:
      [ 3142.776265][ T2649]
      [ 3142.777159][ T2649] ======================================================
      [ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
      [ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G        W
      [ 3142.785684][ T2649] ------------------------------------------------------
      [ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
      [ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [ 3142.794678][ T2649]
      [ 3142.794678][ T2649] but task is already holding lock:
      [ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.801462][ T2649]
      [ 3142.801462][ T2649] which lock already depends on the new lock.
      [ 3142.801462][ T2649]
      [ 3142.805277][ T2649]
      [ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
      [ 3142.808902][ T2649]
      [ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
      [ 3142.812396][ T2649]        __mutex_lock+0x7a/0x9d0
      [ 3142.814184][ T2649]        cached_dev_free+0x17/0x120 [bcache]
      [ 3142.816415][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.818413][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.820276][ T2649]        kthread+0x125/0x140
      [ 3142.822061][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.823965][ T2649]
      [ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [ 3142.827244][ T2649]        process_one_work+0x277/0x640
      [ 3142.829160][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.830958][ T2649]        kthread+0x125/0x140
      [ 3142.832674][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.834915][ T2649]
      [ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [ 3142.838121][ T2649]        lock_acquire+0xb4/0x1c0
      [ 3142.840025][ T2649]        flush_workqueue+0xae/0x4c0
      [ 3142.842035][ T2649]        drain_workqueue+0xa9/0x180
      [ 3142.844042][ T2649]        destroy_workqueue+0x17/0x250
      [ 3142.846142][ T2649]        cached_dev_free+0x52/0x120 [bcache]
      [ 3142.848530][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.850663][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.852464][ T2649]        kthread+0x125/0x140
      [ 3142.854106][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.855880][ T2649]
      [ 3142.855880][ T2649] other info that might help us debug this:
      [ 3142.855880][ T2649]
      [ 3142.859663][ T2649] Chain exists of:
      [ 3142.859663][ T2649]   (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
      [ 3142.859663][ T2649]
      [ 3142.865424][ T2649]  Possible unsafe locking scenario:
      [ 3142.865424][ T2649]
      [ 3142.868022][ T2649]        CPU0                    CPU1
      [ 3142.869885][ T2649]        ----                    ----
      [ 3142.871751][ T2649]   lock(&bch_register_lock);
      [ 3142.873379][ T2649]                                lock((work_completion)(&cl->work)#2);
      [ 3142.876399][ T2649]                                lock(&bch_register_lock);
      [ 3142.879727][ T2649]   lock((wq_completion)bcache_writeback_wq);
      [ 3142.882064][ T2649]
      [ 3142.882064][ T2649]  *** DEADLOCK ***
      [ 3142.882064][ T2649]
      [ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
      [ 3142.887245][ T2649]  #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.890815][ T2649]  #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.894884][ T2649]  #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.898797][ T2649]
      [ 3142.898797][ T2649] stack backtrace:
      [ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
      [ 3142.911422][ T2649] Call Trace:
      [ 3142.912656][ T2649]  dump_stack+0x85/0xcb
      [ 3142.914181][ T2649]  print_circular_bug+0x19a/0x1f0
      [ 3142.916193][ T2649]  __lock_acquire+0x16cd/0x1850
      [ 3142.917936][ T2649]  ? __lock_acquire+0x6a8/0x1850
      [ 3142.919704][ T2649]  ? lock_acquire+0xb4/0x1c0
      [ 3142.921335][ T2649]  ? find_held_lock+0x34/0xa0
      [ 3142.923052][ T2649]  lock_acquire+0xb4/0x1c0
      [ 3142.924635][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.926375][ T2649]  flush_workqueue+0xae/0x4c0
      [ 3142.928047][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.929824][ T2649]  ? drain_workqueue+0xa9/0x180
      [ 3142.931686][ T2649]  drain_workqueue+0xa9/0x180
      [ 3142.933534][ T2649]  destroy_workqueue+0x17/0x250
      [ 3142.935787][ T2649]  cached_dev_free+0x52/0x120 [bcache]
      [ 3142.937795][ T2649]  process_one_work+0x2a4/0x640
      [ 3142.939803][ T2649]  worker_thread+0x39/0x3f0
      [ 3142.941487][ T2649]  ? process_one_work+0x640/0x640
      [ 3142.943389][ T2649]  kthread+0x125/0x140
      [ 3142.944894][ T2649]  ? kthread_create_worker_on_cpu+0x70/0x70
      [ 3142.947744][ T2649]  ret_from_fork+0x3a/0x50
      [ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the deadlock happens.
      1) bcache_reboot() calls bcache_device_stop(), then inside
         bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
         Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
      2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
      3) In cached_dev_free(), when stopping the writeback kthread of the
         cached device by kthread_stop(), dc->writeback_thread will be waken
         up to quite the kthread while-loop, then cached_dev_put() is called
         in bch_writeback_thread().
      4) Calling cached_dev_put() in writeback kthread may drop dc->count to
         0, then dc->detach kworker is scheduled, which is initialized as
         cached_dev_detach_finish().
      5) Inside cached_dev_detach_finish(), the last line of code is to call
         closure_put(&dc->disk.cl), which drops the last reference counter of
         closrure dc->disk.cl, then the callback cached_dev_flush() gets
         called.
      Now cached_dev_flush() is called for second time in the code path, the
      first time is in step 2). And again bch_register_lock will be acquired
      again, and a A-A lock (lockdep terminology) is happening.
      
      The root cause of the above A-A lock is in cached_dev_free(), mutex
      bch_register_lock is held before stopping writeback kthread and other
      kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
      prevent device registration or unregistration during reboot/shutdown
      time, so it is unncessary to hold bch_register_lock such early now.
      
      This is how this patch fixes the reboot/shutdown time A-A lock issue:
      After moving mutex_lock(&bch_register_lock) to a later location where
      before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
      problem can be solved without any reboot time registration race.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80265d8d
    • Coly Li's avatar
      bcache: acquire bch_register_lock later in cached_dev_detach_finish() · 97ba3b81
      Coly Li authored
      Now there is variable bcache_is_reboot to prevent device register or
      unregister during reboot, it is unncessary to still hold mutex lock
      bch_register_lock before stopping writeback_rate_update kworker and
      writeback kthread. And if the stopping kworker or kthread holding
      bch_register_lock inside their routine (we used to have such problem
      in writeback thread, thanks to Junhui Wang fixed it), it is very easy
      to introduce deadlock during reboot/shutdown procedure.
      
      Therefore in this patch, the location to acquire bch_register_lock is
      moved to the location before calling calc_cached_dev_sectors(). Which
      is later then original location in cached_dev_detach_finish().
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97ba3b81
    • Coly Li's avatar
      bcache: avoid a deadlock in bcache_reboot() · a59ff6cc
      Coly Li authored
      It is quite frequently to observe deadlock in bcache_reboot() happens
      and hang the system reboot process. The reason is, in bcache_reboot()
      when calling bch_cache_set_stop() and bcache_device_stop() the mutex
      bch_register_lock is held. But in the process to stop cache set and
      bcache device, bch_register_lock will be acquired again. If this mutex
      is held here, deadlock will happen inside the stopping process. The
      aftermath of the deadlock is, whole system reboot gets hung.
      
      The fix is to avoid holding bch_register_lock for the following loops
      in bcache_reboot(),
             list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
                      bch_cache_set_stop(c);
      
              list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
                      bcache_device_stop(&dc->disk);
      
      A module range variable 'bcache_is_reboot' is added, it sets to true
      in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
      to be true, reject the registration by returning -EBUSY immediately.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a59ff6cc
    • Coly Li's avatar
      bcache: stop writeback kthread and kworker when bch_cached_dev_run() failed · 5c2a634c
      Coly Li authored
      In bch_cached_dev_attach() after bch_cached_dev_writeback_start()
      called, the wrireback kthread and writeback rate update kworker of the
      cached device are created, if the following bch_cached_dev_run()
      failed, bch_cached_dev_attach() will return with -ENOMEM without
      stopping the writeback related kthread and kworker.
      
      This patch stops writeback kthread and writeback rate update kworker
      before returning -ENOMEM if bch_cached_dev_run() returns error.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5c2a634c
    • Coly Li's avatar
      bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread · f54d801d
      Coly Li authored
      Commit 9baf3097 ("bcache: fix for gc and write-back race") added a
      new work queue dc->writeback_write_wq, but forgot to destroy it in the
      error condition when creating dc->writeback_thread failed.
      
      This patch destroys dc->writeback_write_wq if kthread_create() returns
      error pointer to dc->writeback_thread, then a memory leak is avoided.
      
      Fixes: 9baf3097 ("bcache: fix for gc and write-back race")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f54d801d
    • Coly Li's avatar
      bcache: fix mistaken sysfs entry for io_error counter · 54619998
      Coly Li authored
      In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
      incorrectly inserted in. The correct entry should be sysfs_io_errors.
      
      This patch fixes the problem and now I/O errors of cached device can be
      read from /sys/block/bcache<N>/bcache/io_errors.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      54619998
    • Coly Li's avatar
      bcache: add pendings_cleanup to stop pending bcache device · 0c277e21
      Coly Li authored
      If a bcache device is in dirty state and its cache set is not
      registered, this bcache device will not appear in /dev/bcache<N>,
      and there is no way to stop it or remove the bcache kernel module.
      
      This is an as-designed behavior, but sometimes people has to reboot
      whole system to release or stop the pending backing device.
      
      This sysfs interface may remove such pending bcache devices when
      write anything into the sysfs file manually.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0c277e21
    • Coly Li's avatar
      bcache: make bset_search_tree() be more understandable · 944a4f34
      Coly Li authored
      The purpose of following code in bset_search_tree() is to avoid a branch
      instruction,
       994         if (likely(f->exponent != 127))
       995                 n = j * 2 + (((unsigned int)
       996                               (f->mantissa -
       997                                bfloat_mantissa(search, f))) >> 31);
       998         else
       999                 n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
      1000                         ? j * 2
      1001                         : j * 2 + 1;
      
      This piece of code is not very clear to understand, even when I tried to
      add code comment for it, I made mistake. This patch removes the implict
      bit operation and uses explicit branch to calculate next location in
      binary tree search.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      944a4f34
    • Coly Li's avatar
      bcache: remove "XXX:" comment line from run_cache_set() · 68a53c95
      Coly Li authored
      In previous bcache patches for Linux v5.2, the failure code path of
      run_cache_set() is tested and fixed. So now the following comment
      line can be removed from run_cache_set(),
      	/* XXX: test this, it's broken */
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68a53c95
    • Coly Li's avatar
      bcache: improve error message in bch_cached_dev_run() · e0faa3d7
      Coly Li authored
      This patch adds more error message in bch_cached_dev_run() to indicate
      the exact reason why an error value is returned. Please notice when
      printing out the "is running already" message, pr_info() is used here,
      because in this case also -EBUSY is returned, the bcache device can
      continue to attach to the cache devince and run, so it won't be an
      error level message in kernel message.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e0faa3d7
    • Coly Li's avatar
      bcache: add more error message in bch_cached_dev_attach() · 633bb2ce
      Coly Li authored
      This patch adds more error message for attaching cached device, this is
      helpful to debug code failure during bache device start up.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      633bb2ce
    • Coly Li's avatar
      bcache: more detailed error message to bcache_device_link() · 4b6efb4b
      Coly Li authored
      This patch adds more accurate error message for specific
      ssyfs_create_link() call, to help debugging failure during
      bcache device start tup.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b6efb4b
    • Coly Li's avatar
      bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() · 383ff218
      Coly Li authored
      When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
      bit is set, bch_journal() may continue to work because the journaling
      bkey might be still in write set yet. The caller of bch_journal() may
      believe the journal still work but the truth is in-memory journal write
      set won't be written into cache device any more. This behavior may
      introduce potential inconsistent metadata status.
      
      This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
      if the bit is set, bch_journal() returns NULL immediately to notice
      caller to know journal does not work.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      383ff218
    • Coly Li's avatar
      bcache: check CACHE_SET_IO_DISABLE in allocator code · e775339e
      Coly Li authored
      If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
      errors, currently allocator routines can still continue allocate
      space which may introduce inconsistent metadata state.
      
      This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
      routines,
      - bch_bucket_alloc()
      - __bch_bucket_alloc_set()
      Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
      may reject allocation request earlier to avoid potential inconsistent
      metadata.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e775339e
    • Coly Li's avatar
      bcache: remove unncessary code in bch_btree_keys_init() · bd9026c8
      Coly Li authored
      Function bch_btree_keys_init() initializes b->set[].size and
      b->set[].data to zero. As the code comments indicates, these code indeed
      is unncessary, because both struct btree_keys and struct bset_tree are
      nested embedded into struct btree, when struct btree is filled with 0
      bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
      b->set[].data are initialized to 0 (a.k.a NULL) already.
      
      This patch removes the redundant code, and add comments in
      bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bd9026c8
    • Coly Li's avatar
      bcache: add return value check to bch_cached_dev_run() · 0b13efec
      Coly Li authored
      This patch adds return value check to bch_cached_dev_run(), now if there
      is error happens inside bch_cached_dev_run(), it can be catched.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0b13efec
    • Alexandru Ardelean's avatar
      bcache: use sysfs_match_string() instead of __sysfs_match_string() · 89e0341a
      Alexandru Ardelean authored
      The arrays (of strings) that are passed to __sysfs_match_string() are
      static, so use sysfs_match_string() which does an implicit ARRAY_SIZE()
      over these arrays.
      
      Functionally, this doesn't change anything.
      The change is more cosmetic.
      
      It only shrinks the static arrays by 1 byte each.
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89e0341a
    • Coly Li's avatar
      bcache: remove unnecessary prefetch() in bset_search_tree() · f960facb
      Coly Li authored
      In function bset_search_tree(), when p >= t->size, t->tree[0] will be
      prefetched by the following code piece,
       974                 unsigned int p = n << 4;
       975
       976                 p &= ((int) (p - t->size)) >> 31;
       977
       978                 prefetch(&t->tree[p]);
      
      The purpose of the above code is to avoid a branch instruction, but
      when p >= t->size, prefetch(&t->tree[0]) has no positive performance
      contribution at all. This patch avoids the unncessary prefetch by only
      calling prefetch() when p < t->size.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f960facb
    • Coly Li's avatar
      bcache: add io error counting in write_bdev_super_endio() · 08ec1e62
      Coly Li authored
      When backing device super block is written by bch_write_bdev_super(),
      the bio complete callback write_bdev_super_endio() simply ignores I/O
      status. Indeed such write request also contribute to backing device
      health status if the request failed.
      
      This patch checkes bio->bi_status in write_bdev_super_endio(), if there
      is error, bch_count_backing_io_errors() will be called to count an I/O
      error to dc->io_errors.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08ec1e62
    • Coly Li's avatar
      bcache: ignore read-ahead request failure on backing device · 578df99b
      Coly Li authored
      When md raid device (e.g. raid456) is used as backing device, read-ahead
      requests on a degrading and recovering md raid device might be failured
      immediately by md raid code, but indeed this md raid array can still be
      read or write for normal I/O requests. Therefore such failed read-ahead
      request are not real hardware failure. Further more, after degrading and
      recovering accomplished, read-ahead requests will be handled by md raid
      array again.
      
      For such condition, I/O failures of read-ahead requests don't indicate
      real health status (because normal I/O still be served), they should not
      be counted into I/O error counter dc->io_errors.
      
      Since there is no simple way to detect whether the backing divice is a
      md raid device, this patch simply ignores I/O failures for read-ahead
      bios on backing device, to avoid bogus backing device failure on a
      degrading md raid array.
      Suggested-and-tested-by: default avatarThorsten Knabe <linux@thorsten-knabe.de>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      578df99b
    • Coly Li's avatar
      bcache: avoid flushing btree node in cache_set_flush() if io disabled · e6dcbd3e
      Coly Li authored
      When cache_set_flush() is called for too many I/O errors detected on
      cache device and the cache set is retiring, inside the function it
      doesn't make sense to flushing cached btree nodes from c->btree_cache
      because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os
      onto cache device will be rejected.
      
      This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE
      is set. If yes, then avoids to flush the cached btree nodes to reduce
      more time and make cache set retiring more faster.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e6dcbd3e
    • Coly Li's avatar
      Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" · 695277f1
      Coly Li authored
      This reverts commit 6147305c.
      
      Although this patch helps the failed bcache device to stop faster when
      too many I/O errors detected on corresponding cached device, setting
      CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
      operation will disable all I/Os on cache set, which means other attached
      bcache devices won't work neither.
      
      Without this patch, the failed bcache device can also be stopped
      eventually if internal I/O accomplished (e.g. writeback). Therefore here
      I revert it.
      
      Fixes: 6147305c ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
      Reported-by: default avatarYong Li <mr.liyong@qq.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      695277f1
    • Coly Li's avatar
      bcache: fix return value error in bch_journal_read() · 0ae49cb7
      Coly Li authored
      When everything is OK in bch_journal_read(), finally the return value
      is returned by,
      	return ret;
      which assumes ret will be 0 here. This assumption is wrong when all
      journal buckets as are full and filled with valid journal entries. In
      such cache the last location referencess read_bucket() sets 'ret' to
      1, which means new jset added into jset list. The jset list is list
      'journal' in caller run_cache_set().
      
      Return 1 to run_cache_set() means something wrong and the cache set
      won't start, but indeed everything is OK.
      
      This patch changes the line at end of bch_journal_read() to directly
      return 0 since everything if verything is good. Then a bogus error
      is fixed.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0ae49cb7
    • Coly Li's avatar
      bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush() · b387e9b5
      Coly Li authored
      When system memory is in heavy pressure, bch_gc_thread_start() from
      run_cache_set() may fail due to out of memory. In such condition,
      c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
      failure code path bch_cache_set_error(), when cache_set_flush() gets
      called, the code piece to stop c->gc_thread is broken,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      
      And KASAN catches such NULL pointer deference problem, with the warning
      information:
      
      [  561.207881] ==================================================================
      [  561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
      [  561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313
      
      [  561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G        W         5.0.0-vanilla+ #3
      [  561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.207935] Workqueue: events cache_set_flush [bcache]
      [  561.207940] Call Trace:
      [  561.207948]  dump_stack+0x9a/0xeb
      [  561.207955]  ? kthread_stop+0x3b/0x440
      [  561.207960]  ? kthread_stop+0x3b/0x440
      [  561.207965]  kasan_report+0x176/0x192
      [  561.207973]  ? kthread_stop+0x3b/0x440
      [  561.207981]  kthread_stop+0x3b/0x440
      [  561.207995]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  561.208008]  process_one_work+0x856/0x1620
      [  561.208015]  ? find_held_lock+0x39/0x1d0
      [  561.208028]  ? drain_workqueue+0x380/0x380
      [  561.208048]  worker_thread+0x87/0xb80
      [  561.208058]  ? __kthread_parkme+0xb6/0x180
      [  561.208067]  ? process_one_work+0x1620/0x1620
      [  561.208072]  kthread+0x326/0x3e0
      [  561.208079]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  561.208090]  ret_from_fork+0x3a/0x50
      [  561.208110] ==================================================================
      [  561.208113] Disabling lock debugging due to kernel taint
      [  561.208115] irq event stamp: 11800231
      [  561.208126] hardirqs last  enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
      [  561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [  561.208129] #PF error: [WRITE]
      [  561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
      [  561.312259] softirqs last  enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
      [  561.405975] PGD 0 P4D 0
      [  561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
      [  561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
      [  561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G    B   W         5.0.0-vanilla+ #3
      [  561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.791371] Workqueue: events cache_set_flush [bcache]
      [  561.791374] RIP: 0010:kthread_stop+0x3b/0x440
      [  561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
      [  561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
      [  561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
      [  563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
      [  563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
      [  563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
      [  563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
      [  563.408620] FS:  0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
      [  563.408622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
      [  563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.915796] PKRU: 55555554
      [  563.915797] Call Trace:
      [  563.915807]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  563.915812]  process_one_work+0x856/0x1620
      [  564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.033563]  ? find_held_lock+0x39/0x1d0
      [  564.033567]  ? drain_workqueue+0x380/0x380
      [  564.033574]  worker_thread+0x87/0xb80
      [  564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.118042]  ? __kthread_parkme+0xb6/0x180
      [  564.118046]  ? process_one_work+0x1620/0x1620
      [  564.118048]  kthread+0x326/0x3e0
      [  564.118050]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.252441]  ret_from_fork+0x3a/0x50
      [  564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
      [  564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.348360] CR2: 000000000000001c
      [  564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---
      
      Therefore, it is not enough to only check whether c->gc_thread is NULL,
      we should use IS_ERR_OR_NULL() to check both NULL pointer and error
      value.
      
      This patch changes the above buggy code piece in this way,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b387e9b5
    • Coly Li's avatar
      bcache: don't set max writeback rate if gc is running · 141df8bb
      Coly Li authored
      When gc is running, user space I/O processes may wait inside
      bcache code, so no new I/O coming. Indeed this is not a real idle
      time, maximum writeback rate should not be set in such situation.
      Otherwise a faster writeback thread may compete locks with gc thread
      and makes garbage collection slower, which results a longer I/O
      freeze period.
      
      This patch checks c->gc_mark_valid in set_at_max_writeback_rate(). If
      c->gc_mark_valid is 0 (gc running), set_at_max_writeback_rate() returns
      false, then update_writeback_rate() will not set writeback rate to
      maximum value even c->idle_counter reaches an idle threshold.
      
      Now writeback thread won't interfere gc thread performance.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      141df8bb
  2. 27 Jun, 2019 1 commit
  3. 26 Jun, 2019 3 commits
  4. 25 Jun, 2019 3 commits
    • Paolo Valente's avatar
      block, bfq: re-schedule empty queues if they deserve I/O plugging · 3726112e
      Paolo Valente authored
      Consider, on one side, a bfq_queue Q that remains empty while in
      service, and, on the other side, the pending I/O of bfq_queues that,
      according to their timestamps, have to be served after Q.  If an
      uncontrolled amount of I/O from the latter bfq_queues were dispatched
      while Q is waiting for its new I/O to arrive, then Q's bandwidth
      guarantees would be violated. To prevent this, I/O dispatch is plugged
      until Q receives new I/O (except for a properly controlled amount of
      injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
      for the following reason.
      
      Preemption is performed in two steps. First, Q is expired and
      re-scheduled. Second, the new bfq_queue to serve is chosen. The first
      step is needed by the second, as the second can be performed only
      after Q's timestamps have been properly updated (done in the
      expiration step), and Q has been re-queued for service. This
      dependency is a consequence of the way how BFQ's scheduling algorithm
      is currently implemented.
      
      But Q is not re-scheduled at all in the first step, because Q is
      empty. As a consequence, an uncontrolled amount of I/O may be
      dispatched until Q becomes non empty again. This breaks Q's service
      guarantees.
      
      This commit addresses this issue by re-scheduling Q even if it is
      empty. This in turn breaks the assumption that all scheduled queues
      are non empty. Then a few extra checks are now needed.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3726112e
    • Paolo Valente's avatar
      block, bfq: preempt lower-weight or lower-priority queues · 96a291c3
      Paolo Valente authored
      BFQ enqueues the I/O coming from each process into a separate
      bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
      served for at most timeout_sync milliseconds (default: 125 ms). This
      service scheme is prone to the following inaccuracy.
      
      While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
      receive I/O, and, according to BFQ's scheduling policy, may become the
      right bfq_queue to serve, in place of the currently in-service
      bfq_queue. In this respect, postponing the service of Q2 to after the
      service of Q1 finishes may delay the completion of Q2's I/O, compared
      with an ideal service in which all non-empty bfq_queues are served in
      parallel, and every non-empty bfq_queue is served at a rate
      proportional to the bfq_queue's weight. This additional delay is equal
      at most to the time Q1 may unjustly remain in service before switching
      to Q2.
      
      If Q1 and Q2 have the same weight, then this time is most likely
      negligible compared with the completion time to be guaranteed to Q2's
      I/O. In addition, first, one of the reasons why BFQ may want to serve
      Q1 for a while is that this boosts throughput and, second, serving Q1
      longer reduces BFQ's overhead. As a conclusion, it is usually better
      not to preempt Q1 if both Q1 and Q2 have the same weight.
      
      In contrast, as Q2's weight or priority becomes higher and higher
      compared with that of Q1, the above delay becomes larger and larger,
      compared with the I/O completion times that have to be guaranteed to
      Q2 according to Q2's weight. So reducing this delay may be more
      important than avoiding the costs of preempting Q1.
      
      Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
      higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
      triggers a new choice of the next bfq_queue to serve. If Q2 really is
      the next bfq_queue to serve, then Q2 will be set in service
      immediately.
      
      This change reduces the component of the I/O latency caused by the
      above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
      SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
      a process doing sporadic random reads while another process is doing
      continuous sequential reads.
      Signed-off-by: default avatarNicola Bottura <bottura.nicola95@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      96a291c3
    • Paolo Valente's avatar
      block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4
      Paolo Valente authored
      A bfq_queue Q may happen to be synchronized with another
      bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
      receive new I/O. We call Q2 "waker queue".
      
      If I/O plugging is being performed for Q, and Q is not receiving any
      more I/O because of the above synchronization, then, thanks to BFQ's
      injection mechanism, the waker queue is likely to get served before
      the I/O-plugging timeout fires.
      
      Unfortunately, this fact may not be sufficient to guarantee a high
      throughput during the I/O plugging, because the inject limit for Q may
      be too low to guarantee a lot of injected I/O. In addition, the
      duration of the plugging, i.e., the time before Q finally receives new
      I/O, may not be minimized, because the waker queue may happen to be
      served only after other queues.
      
      To address these issues, this commit introduces the explicit detection
      of the waker queue, and the unconditional injection of a pending I/O
      request of the waker queue on each invocation of
      bfq_dispatch_request().
      
      One may be concerned that this systematic injection of I/O from the
      waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
      the contrary, next Q's I/O is brought forward dramatically, for it is
      not blocked for milliseconds.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13a857a4