1. 27 May, 2022 2 commits
  2. 24 May, 2022 4 commits
    • Coly Li's avatar
      bcache: avoid journal no-space deadlock by reserving 1 journal bucket · 32feee36
      Coly Li authored
      The journal no-space deadlock was reported time to time. Such deadlock
      can happen in the following situation.
      
      When all journal buckets are fully filled by active jset with heavy
      write I/O load, the cache set registration (after a reboot) will load
      all active jsets and inserting them into the btree again (which is
      called journal replay). If a journaled bkey is inserted into a btree
      node and results btree node split, new journal request might be
      triggered. For example, the btree grows one more level after the node
      split, then the root node record in cache device super block will be
      upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
      space in journal buckets, the journal replay has to wait for new journal
      bucket to be reclaimed after at least one journal bucket replayed. This
      is one example that how the journal no-space deadlock happens.
      
      The solution to avoid the deadlock is to reserve 1 journal bucket in
      run time, and only permit the reserved journal bucket to be used during
      cache set registration procedure for things like journal replay. Then
      the journal space will never be fully filled, there is no chance for
      journal no-space deadlock to happen anymore.
      
      This patch adds a new member "bool do_reserve" in struct journal, it is
      inititalized to 0 (false) when struct journal is allocated, and set to
      1 (true) by bch_journal_space_reserve() when all initialization done in
      run_cache_set(). In the run time when journal_reclaim() tries to
      allocate a new journal bucket, free_journal_buckets() is called to check
      whether there are enough free journal buckets to use. If there is only
      1 free journal bucket and journal->do_reserve is 1 (true), the last
      bucket is reserved and free_journal_buckets() will return 0 to indicate
      no free journal bucket. Then journal_reclaim() will give up, and try
      next time to see whetheer there is free journal bucket to allocate. By
      this method, there is always 1 jouranl bucket reserved in run time.
      
      During the cache set registration, journal->do_reserve is 0 (false), so
      the reserved journal bucket can be used to avoid the no-space deadlock.
      Reported-by: default avatarNikhil Kshirsagar <nkshirsagar@gmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32feee36
    • Coly Li's avatar
      bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() · 80db4e47
      Coly Li authored
      After making bch_sectors_dirty_init() being multithreaded, the existing
      incremental dirty sector counting in bch_root_node_dirty_init() doesn't
      release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
      bkeys. Because a read lock is added on btree root node to prevent the
      btree to be split during the dirty sectors counting, other I/O requester
      has no chance to gain the write lock even restart bcache_btree().
      
      That is to say, the incremental dirty sectors counting is incompatible
      to the multhreaded bch_sectors_dirty_init(). We have to choose one and
      drop another one.
      
      In my testing, with 512 bytes random writes, I generate 1.2T dirty data
      and a btree with 400K nodes. With single thread and incremental dirty
      sectors counting, it takes 30+ minites to register the backing device.
      And with multithreaded dirty sectors counting, the backing device
      registration can be accomplished within 2 minutes.
      
      The 30+ minutes V.S. 2- minutes difference makes me decide to keep
      multithreaded bch_sectors_dirty_init() and drop the incremental dirty
      sectors counting. This is what this patch does.
      
      But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
      will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
      iterated. This is to avoid the watchdog reports a bogus soft lockup
      warning.
      
      Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80db4e47
    • Coly Li's avatar
      bcache: improve multithreaded bch_sectors_dirty_init() · 4dc34ae1
      Coly Li authored
      Commit b144e45f ("bcache: make bch_sectors_dirty_init() to be
      multithreaded") makes bch_sectors_dirty_init() to be much faster
      when counting dirty sectors by iterating all dirty keys in the btree.
      But it isn't in ideal shape yet, still can be improved.
      
      This patch does the following changes to improve current parallel dirty
      keys iteration on the btree,
      - Add read lock to root node when multiple threads iterating the btree,
        to prevent the root node gets split by I/Os from other registered
        bcache devices.
      - Remove local variable "char name[32]" and generate kernel thread name
        string directly when calling kthread_run().
      - Allocate "struct bch_dirty_init_state state" directly on stack and
        avoid the unnecessary dynamic memory allocation for it.
      - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
      - Increase &state->started to count created kernel thread after it
        succeeds to create.
      - When wait for all dirty key counting threads to finish, use
        wait_event() to replace wait_event_interruptible().
      
      With the above changes, the code is more clear, and some potential error
      conditions are avoided.
      
      Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4dc34ae1
    • Coly Li's avatar
      bcache: improve multithreaded bch_btree_check() · 62253644
      Coly Li authored
      Commit 8e710227 ("bcache: make bch_btree_check() to be
      multithreaded") makes bch_btree_check() to be much faster when checking
      all btree nodes during cache device registration. But it isn't in ideal
      shap yet, still can be improved.
      
      This patch does the following thing to improve current parallel btree
      nodes check by multiple threads in bch_btree_check(),
      - Add read lock to root node while checking all the btree nodes with
        multiple threads. Although currently it is not mandatory but it is
        good to have a read lock in code logic.
      - Remove local variable 'char name[32]', and generate kernel thread name
        string directly when calling kthread_run().
      - Allocate local variable "struct btree_check_state check_state" on the
        stack and avoid unnecessary dynamic memory allocation for it.
      - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
      - Increase check_state->started to count created kernel thread after it
        succeeds to create.
      - When wait for all checking kernel threads to finish, use wait_event()
        to replace wait_event_interruptible().
      
      With this change, the code is more clear, and some potential error
      conditions are avoided.
      
      Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      62253644
  3. 23 May, 2022 6 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · df7e7f2b
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.19/drivers
      
      Pull MD updates from Song:
      
      "- Remove uses of bdevname, by Christoph Hellwig;
       - Bug fixes by Guoqing Jiang, and Xiao Ni."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md: fix double free of io_acct_set bioset
        md: Don't set mddev private to NULL in raid0 pers->free
        md: remove most calls to bdevname
        md: protect md_unregister_thread from reentrancy
        md: don't unregister sync_thread with reconfig_mutex held
      df7e7f2b
    • Xiao Ni's avatar
      md: fix double free of io_acct_set bioset · 42b805af
      Xiao Ni authored
      Now io_acct_set is alloc and free in personality. Remove the codes that
      free io_acct_set in md_free and md_stop.
      
      Fixes: 0c031fd3 (md: Move alloc/free acct bioset in to personality)
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      42b805af
    • Xiao Ni's avatar
      md: Don't set mddev private to NULL in raid0 pers->free · 0f2571ad
      Xiao Ni authored
      In normal stop process, it does like this:
         do_md_stop
            |
         __md_stop (pers->free(); mddev->private=NULL)
            |
         md_free (free mddev)
      __md_stop sets mddev->private to NULL after pers->free. The raid device
      will be stopped and mddev memory is free. But in reshape, it doesn't
      free the mddev and mddev will still be used in new raid.
      
      In reshape, it first sets mddev->private to new_pers and then runs
      old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free.
      The new raid can't work anymore. It will panic when dereference
      mddev->private because of NULL pointer dereference.
      
      It can panic like this:
      [63010.814972] kernel BUG at drivers/md/raid10.c:928!
      [63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      [63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1
      [63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020
      [63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10]
      [63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246
      [63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000
      [63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800
      [63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000
      [63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200
      [63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003
      [63010.906399] FS:  0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000
      [63010.914485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0
      [63010.927363] Call Trace:
      [63010.929822]  ? bio_reset+0xe/0x40
      [63010.933144]  ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10]
      [63010.938629]  raid10_sync_request+0x756/0x1610 [raid10]
      [63010.943770]  md_do_sync.cold+0x3e4/0x94c
      [63010.947698]  md_thread+0xab/0x160
      [63010.951024]  ? md_write_inc+0x50/0x50
      [63010.954688]  kthread+0x149/0x170
      [63010.957923]  ? set_kthread_struct+0x40/0x40
      [63010.962107]  ret_from_fork+0x22/0x30
      
      Removing the code that sets mddev->private to NULL in raid0 can fix
      problem.
      
      Fixes: 0c031fd3 (md: Move alloc/free acct bioset in to personality)
      Reported-by: default avatarFine Fan <ffan@redhat.com>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      0f2571ad
    • Christoph Hellwig's avatar
      md: remove most calls to bdevname · 913cce5a
      Christoph Hellwig authored
      Use the %pg format specifier to save on stack consumption and code size.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      913cce5a
    • Guoqing Jiang's avatar
      md: protect md_unregister_thread from reentrancy · 1e267742
      Guoqing Jiang authored
      Generally, the md_unregister_thread is called with reconfig_mutex, but
      raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
      so md_unregister_thread can be called simulitaneously from two call sites
      in theory.
      
      Then after previous commit which remove the protection of reconfig_mutex
      for md_unregister_thread completely, the potential issue could be worse
      than before.
      
      Let's take pers_lock at the beginning of function to ensure reentrancy.
      Reported-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      1e267742
    • Guoqing Jiang's avatar
      md: don't unregister sync_thread with reconfig_mutex held · 8b48ec23
      Guoqing Jiang authored
      Unregister sync_thread doesn't need to hold reconfig_mutex since it
      doesn't reconfigure array.
      
      And it could cause deadlock problem for raid5 as follows:
      
      1. process A tried to reap sync thread with reconfig_mutex held after echo
         idle to sync_action.
      2. raid5 sync thread was blocked if there were too many active stripes.
      3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
         which causes the number of active stripes can't be decreased.
      4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
         to hold reconfig_mutex.
      
      More details in the link:
      https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
      
      And add one parameter to md_reap_sync_thread since it could be called by
      dm-raid which doesn't hold reconfig_mutex.
      Reported-and-tested-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      8b48ec23
  4. 21 May, 2022 1 commit
  5. 20 May, 2022 1 commit
  6. 19 May, 2022 1 commit
    • Chaitanya Kulkarni's avatar
      nvme: set non-mdts limits in nvme_scan_work · 78288665
      Chaitanya Kulkarni authored
      In current implementation we set the non-mdts limits by calling
      nvme_init_non_mdts_limits() from nvme_init_ctrl_finish().
      This also tries to set the limits for the discovery controller which
      has no I/O queues resulting in the warning message reported by the
      nvme_log_error() when running blktest nvme/002: -
      
      [ 2005.155946] run blktests nvme/002 at 2022-04-09 16:57:47
      [ 2005.192223] loop: module loaded
      [ 2005.196429] nvmet: adding nsid 1 to subsystem blktests-subsystem-0
      [ 2005.200334] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
      
      <------------------------------SNIP---------------------------------->
      
      [ 2008.958108] nvmet: adding nsid 1 to subsystem blktests-subsystem-997
      [ 2008.962082] nvmet: adding nsid 1 to subsystem blktests-subsystem-998
      [ 2008.966102] nvmet: adding nsid 1 to subsystem blktests-subsystem-999
      [ 2008.973132] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN testhostnqn.
      *[ 2008.973196] nvme1: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE DNR*
      [ 2008.974595] nvme nvme1: new ctrl: "nqn.2014-08.org.nvmexpress.discovery"
      [ 2009.103248] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
      
      Move the call of nvme_init_non_mdts_limits() to nvme_scan_work() after
      we verify that I/O queues are created since that is a converging point
      for each transport where these limits are actually used.
      
      1. FC :
      nvme_fc_create_association()
       ...
       nvme_fc_create_io_queues(ctrl);
       ...
       nvme_start_ctrl()
        nvme_scan_queue()
         nvme_scan_work()
      
      2. PCIe:-
      nvme_reset_work()
       ...
       nvme_setup_io_queues()
        nvme_create_io_queues()
         nvme_alloc_queue()
       ...
       nvme_start_ctrl()
        nvme_scan_queue()
         nvme_scan_work()
      
      3. RDMA :-
      nvme_rdma_setup_ctrl
       ...
        nvme_rdma_configure_io_queues
        ...
        nvme_start_ctrl()
         nvme_scan_queue()
          nvme_scan_work()
      
      4. TCP :-
      nvme_tcp_setup_ctrl
       ...
        nvme_tcp_configure_io_queues
        ...
        nvme_start_ctrl()
         nvme_scan_queue()
          nvme_scan_work()
      
      * nvme_scan_work()
      ...
      nvme_validate_or_alloc_ns()
        nvme_alloc_ns()
         nvme_update_ns_info()
          nvme_update_disk_info()
           nvme_config_discard() <---
           blk_queue_max_write_zeroes_sectors() <---
      Signed-off-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      78288665
  7. 18 May, 2022 2 commits
    • Christoph Hellwig's avatar
      nvme: add support for TP4084 - Time-to-Ready Enhancements · 354201c5
      Christoph Hellwig authored
      Add support for using longer timeouts during controller initialization
      and letting the controller come up with namespaces that are not ready
      for I/O yet.  We skip these not ready namespaces during scanning and
      only bring them online once anoter scan is kicked off by the AEN that
      is set when the NRDY bit gets set in the  I/O Command Set Independent
      Identify Namespace Data Structure.   This asynchronous probing avoids
      blocking the kernel boot when controllers take a very long time to
      recover after unclean shutdowns (up to minutes).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      354201c5
    • Jens Axboe's avatar
      Merge tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme into for-5.19/drivers · da14f237
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "nvme updates for Linux 5.19
      
       - tighten the PCI presence check (Stefan Roese):
       - fix a potential NULL pointer dereference in an error path
         (Kyle Miller Smith)
       - fix interpretation of the DMRSL field (Tom Yan)
       - relax the data transfer alignment (Keith Busch)
       - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni)
       - misc cleanups (Chaitanya Kulkarni, me)"
      
      * tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme:
        nvme: split the enum used for various register constants
        nvme-fabrics: add a request timeout helper
        nvme-pci: harden drive presence detect in nvme_dev_disable()
        nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
        nvme: mark internal passthru request RQF_QUIET
        nvme: remove unneeded include from constants file
        nvme: add missing status values to verbose logging
        nvme: set dma alignment to dword
        nvme: fix interpretation of DMRSL
      da14f237
  8. 17 May, 2022 1 commit
  9. 16 May, 2022 9 commits
  10. 10 May, 2022 4 commits
  11. 04 May, 2022 5 commits
  12. 03 May, 2022 4 commits