• Sagi Grimberg's avatar
    nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
    Sagi Grimberg authored
    A deadlock happens in the following scenario with multipath:
    1) scan_work(nvme0) detects a new nsid while nvme0
        is an optimized path to it, path nvme1 happens to be
        inaccessible.
    
    2) Before scan_work is complete nvme0 disconnect is initiated
        nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
    
    3) scan_work(1) attempts to submit IO,
        but nvme_path_is_optimized() observes nvme0 is not LIVE.
        Since nvme1 is a possible path IO is requeued and scan_work hangs.
    
    --
    Workqueue: nvme-wq nvme_scan_work [nvme_core]
    kernel: Call Trace:
    kernel:  __schedule+0x2b9/0x6c0
    kernel:  schedule+0x42/0xb0
    kernel:  io_schedule+0x16/0x40
    kernel:  do_read_cache_page+0x438/0x830
    kernel:  read_cache_page+0x12/0x20
    kernel:  read_dev_sector+0x27/0xc0
    kernel:  read_lba+0xc1/0x220
    kernel:  efi_partition+0x1e6/0x708
    kernel:  check_partition+0x154/0x244
    kernel:  rescan_partitions+0xae/0x280
    kernel:  __blkdev_get+0x40f/0x560
    kernel:  blkdev_get+0x3d/0x140
    kernel:  __device_add_disk+0x388/0x480
    kernel:  device_add_disk+0x13/0x20
    kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
    kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
    kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
    kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
    kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
    kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
    kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
    kernel:  process_one_work+0x1db/0x380
    kernel:  worker_thread+0x249/0x400
    kernel:  kthread+0x104/0x140
    --
    
    4) Delete also hangs in flush_work(ctrl->scan_work)
        from nvme_remove_namespaces().
    
    Similiarly a deadlock with ana_work may happen: if ana_work has started
    and calls nvme_mpath_set_live and device_add_disk, it will
    trigger I/O. When we trigger disconnect I/O will block because
    our accessible (optimized) path is disconnecting, but the alternate
    path is inaccessible, so I/O blocks. Then disconnect tries to flush
    the ana_work and hangs.
    
    [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
    [  605.552087] Call Trace:
    [  605.552683]  __schedule+0x2b9/0x6c0
    [  605.553507]  schedule+0x42/0xb0
    [  605.554201]  io_schedule+0x16/0x40
    [  605.555012]  do_read_cache_page+0x438/0x830
    [  605.556925]  read_cache_page+0x12/0x20
    [  605.557757]  read_dev_sector+0x27/0xc0
    [  605.558587]  amiga_partition+0x4d/0x4c5
    [  605.561278]  check_partition+0x154/0x244
    [  605.562138]  rescan_partitions+0xae/0x280
    [  605.563076]  __blkdev_get+0x40f/0x560
    [  605.563830]  blkdev_get+0x3d/0x140
    [  605.564500]  __device_add_disk+0x388/0x480
    [  605.565316]  device_add_disk+0x13/0x20
    [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
    [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
    [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
    [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
    [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
    [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
    [  605.573330]  process_one_work+0x1db/0x380
    [  605.574144]  worker_thread+0x4d/0x400
    [  605.574896]  kthread+0x104/0x140
    [  605.577205]  ret_from_fork+0x35/0x40
    [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
    [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
    [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  605.582320] nvme            D    0 14044  14043 0x00000000
    [  605.583424] Call Trace:
    [  605.583935]  __schedule+0x2b9/0x6c0
    [  605.584625]  schedule+0x42/0xb0
    [  605.585290]  schedule_timeout+0x203/0x2f0
    [  605.588493]  wait_for_completion+0xb1/0x120
    [  605.590066]  __flush_work+0x123/0x1d0
    [  605.591758]  __cancel_work_timer+0x10e/0x190
    [  605.593542]  cancel_work_sync+0x10/0x20
    [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
    [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
    [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
    [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
    [  605.598320]  dev_attr_store+0x17/0x30
    
    Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
    indicate the phase of controller deletion where I/O cannot be allowed
    to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
    be issued to the bottom device, and only after we flush the ana_work
    and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
    we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
    from re-firing by aborting early if we are not LIVE, so we should be safe
    here.
    
    In addition, change the transport drivers to follow the updated state
    machine.
    
    Fixes: 0d0b660f ("nvme: add ANA support")
    Reported-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
    Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    ecca390e
nvme.h 20.5 KB