• Heming Zhao's avatar
    md-cluster: fix hanging issue while a new disk adding · fff42f21
    Heming Zhao authored
    The commit 1bbe254e ("md-cluster: check for timeout while a
    new disk adding") is correct in terms of code syntax but not
    suite real clustered code logic.
    
    When a timeout occurs while adding a new disk, if recv_daemon()
    bypasses the unlock for ack_lockres:CR, another node will be waiting
    to grab EX lock. This will cause the cluster to hang indefinitely.
    
    How to fix:
    
    1. In dlm_lock_sync(), change the wait behaviour from forever to a
       timeout, This could avoid the hanging issue when another node
       fails to handle cluster msg. Another result of this change is
       that if another node receives an unknown msg (e.g. a new msg_type),
       the old code will hang, whereas the new code will timeout and fail.
       This could help cluster_md handle new msg_type from different
       nodes with different kernel/module versions (e.g. The user only
       updates one leg's kernel and monitors the stability of the new
       kernel).
    2. The old code for __sendmsg() always returns 0 (success) under the
       design (must successfully unlock ->message_lockres). This commit
       makes this function return an error number when an error occurs.
    
    Fixes: 1bbe254e ("md-cluster: check for timeout while a new disk adding")
    Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
    Reviewed-by: default avatarSu Yue <glass.su@suse.com>
    Acked-by: default avatarYu Kuai <yukuai3@huawei.com>
    Signed-off-by: default avatarSong Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
    fff42f21
md-cluster.c 44.4 KB