• Heming Zhao's avatar
    md-cluster: fix no recovery job when adding/re-adding a disk · 35a0a409
    Heming Zhao authored
    The commit db5e653d ("md: delay choosing sync action to
    md_start_sync()") delays the start of the sync action. In a
    clustered environment, this will cause another node to first
    activate the spare disk and skip recovery. As a result, no
    nodes will perform recovery when a disk is added or re-added.
    
    Before db5e653d:
    
    ```
       node1                                node2
    ----------------------------------------------------------------
    md_check_recovery
     + md_update_sb
     |  sendmsg: METADATA_UPDATED
     + md_choose_sync_action           process_metadata_update
     |  remove_and_add_spares           //node1 has not finished adding
     + call mddev->sync_work            //the spare disk:do nothing
    
    md_start_sync
     starts md_do_sync
    
    md_do_sync
     + grabbed resync_lockres:DLM_LOCK_EX
     + do syncing job
    
    md_check_recovery
     sendmsg: METADATA_UPDATED
                                     process_metadata_update
                                       //activate spare disk
    
                                     ... ...
    
                                     md_do_sync
                                      waiting to grab resync_lockres:EX
    ```
    
    After db5e653d:
    
    (note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
    starts md_start_sync, setting the INTR action will exacerbate the
    delay in node1 calling the md_do_sync function.)
    
    ```
       node1                                node2
    ----------------------------------------------------------------
    md_check_recovery
     + md_update_sb
     |  sendmsg: METADATA_UPDATED
     + calls mddev->sync_work         process_metadata_update
                                       //node1 has not finished adding
                                       //the spare disk:do nothing
    
    md_start_sync
     + md_choose_sync_action
     |  remove_and_add_spares
     + calls md_do_sync
    
    md_check_recovery
     md_update_sb
      sendmsg: METADATA_UPDATED
                                      process_metadata_update
                                        //activate spare disk
    
      ... ...                         ... ...
    
                                      md_do_sync
                                       + grabbed resync_lockres:EX
                                       + raid1_sync_request skip sync under
    				     conf->fullsync:0
    md_do_sync
     1. waiting to grab resync_lockres:EX
     2. when node1 could grab EX lock,
        node1 will skip resync under recovery_offset:MaxSector
    ```
    
    How to trigger:
    
    ```(commands @node1)
     # to easily watch the recovery status
    echo 2000 > /proc/sys/dev/raid/speed_limit_max
    ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"
    
    mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
    ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
    mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
    mdadm --manage /dev/md0 --add /dev/sdc
    
    === "cat /proc/mdstat" on both node, there are no recovery action. ===
    ```
    
    How to fix:
    
    because md layer code logic is hard to restore for speeding up sync job
    on local node, we add new cluster msg to pending the another node to
    active disk.
    Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
    Reviewed-by: default avatarSu Yue <glass.su@suse.com>
    Acked-by: default avatarYu Kuai <yukuai3@huawei.com>
    Signed-off-by: default avatarSong Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
    35a0a409
md-cluster.c 45.2 KB