1. 12 Jul, 2024 1 commit
    • Heming Zhao's avatar
      md-cluster: fix hanging issue while a new disk adding · fff42f21
      Heming Zhao authored
      The commit 1bbe254e ("md-cluster: check for timeout while a
      new disk adding") is correct in terms of code syntax but not
      suite real clustered code logic.
      
      When a timeout occurs while adding a new disk, if recv_daemon()
      bypasses the unlock for ack_lockres:CR, another node will be waiting
      to grab EX lock. This will cause the cluster to hang indefinitely.
      
      How to fix:
      
      1. In dlm_lock_sync(), change the wait behaviour from forever to a
         timeout, This could avoid the hanging issue when another node
         fails to handle cluster msg. Another result of this change is
         that if another node receives an unknown msg (e.g. a new msg_type),
         the old code will hang, whereas the new code will timeout and fail.
         This could help cluster_md handle new msg_type from different
         nodes with different kernel/module versions (e.g. The user only
         updates one leg's kernel and monitors the stability of the new
         kernel).
      2. The old code for __sendmsg() always returns 0 (success) under the
         design (must successfully unlock ->message_lockres). This commit
         makes this function return an error number when an error occurs.
      
      Fixes: 1bbe254e ("md-cluster: check for timeout while a new disk adding")
      Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Reviewed-by: default avatarSu Yue <glass.su@suse.com>
      Acked-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
      fff42f21
  2. 10 Jul, 2024 4 commits
  3. 09 Jul, 2024 9 commits
  4. 08 Jul, 2024 4 commits
  5. 05 Jul, 2024 12 commits
  6. 04 Jul, 2024 9 commits
  7. 02 Jul, 2024 1 commit