1. 12 Jul, 2024 3 commits
    • Mateusz Jończyk's avatar
      md/raid1: set max_sectors during early return from choose_slow_rdev() · 36a5c03f
      Mateusz Jończyk authored
      Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
      when that drive has a write-mostly flag set. During such an attempt,
      the following assertion in bio_split() is hit:
      
      	BUG_ON(sectors <= 0);
      
      Call Trace:
      	? bio_split+0x96/0xb0
      	? exc_invalid_op+0x53/0x70
      	? bio_split+0x96/0xb0
      	? asm_exc_invalid_op+0x1b/0x20
      	? bio_split+0x96/0xb0
      	? raid1_read_request+0x890/0xd20
      	? __call_rcu_common.constprop.0+0x97/0x260
      	raid1_make_request+0x81/0xce0
      	? __get_random_u32_below+0x17/0x70
      	? new_slab+0x2b3/0x580
      	md_handle_request+0x77/0x210
      	md_submit_bio+0x62/0xa0
      	__submit_bio+0x17b/0x230
      	submit_bio_noacct_nocheck+0x18e/0x3c0
      	submit_bio_noacct+0x244/0x670
      
      After investigation, it turned out that choose_slow_rdev() does not set
      the value of max_sectors in some cases and because of it,
      raid1_read_request calls bio_split with sectors == 0.
      
      Fix it by filling in this variable.
      
      This bug was introduced in
      commit dfa8ecd1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
      but apparently hidden until
      commit 0091c5a2 ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
      shortly thereafter.
      
      Cc: stable@vger.kernel.org # 6.9.x+
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Fixes: dfa8ecd1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
      Cc: Song Liu <song@kernel.org>
      Cc: Yu Kuai <yukuai3@huawei.com>
      Cc: Paul Luse <paul.e.luse@linux.intel.com>
      Cc: Xiao Ni <xni@redhat.com>
      Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
      
      --
      
      Tested on both Linux 6.10 and 6.9.8.
      
      Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
      	./test --dev=loop --no-error --raidtype=raid1
      (on 6.9.8 there was one failure, caused by external bitmap support not
      compiled in).
      
      Notes:
      - I was reliably getting deadlocks when adding / removing devices
        on such an array - while the array was loaded with fsstress with 20
        concurrent processes. When the array was idle or loaded with fsstress
        with 8 processes, no such deadlocks happened in my tests.
        This occurred also on unpatched Linux 6.8.0 though, but not on
        6.1.97-rc1, so this is likely an independent regression (to be
        investigated).
      - I was also getting deadlocks when adding / removing the bitmap on the
        array in similar conditions - this happened on Linux 6.1.97-rc1
        also though. fsstress with 8 concurrent processes did cause it only
        once during many tests.
      - in my testing, there was once a problem with hot adding an
        internal bitmap to the array:
      	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
      	mdadm: failed to set internal bitmap.
        even though no such reshaping was happening according to /proc/mdstat.
        This seems unrelated, though.
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20240711202316.10775-1-mat.jonczyk@o2.pl
      36a5c03f
    • Heming Zhao's avatar
      md-cluster: fix no recovery job when adding/re-adding a disk · 35a0a409
      Heming Zhao authored
      The commit db5e653d ("md: delay choosing sync action to
      md_start_sync()") delays the start of the sync action. In a
      clustered environment, this will cause another node to first
      activate the spare disk and skip recovery. As a result, no
      nodes will perform recovery when a disk is added or re-added.
      
      Before db5e653d:
      
      ```
         node1                                node2
      ----------------------------------------------------------------
      md_check_recovery
       + md_update_sb
       |  sendmsg: METADATA_UPDATED
       + md_choose_sync_action           process_metadata_update
       |  remove_and_add_spares           //node1 has not finished adding
       + call mddev->sync_work            //the spare disk:do nothing
      
      md_start_sync
       starts md_do_sync
      
      md_do_sync
       + grabbed resync_lockres:DLM_LOCK_EX
       + do syncing job
      
      md_check_recovery
       sendmsg: METADATA_UPDATED
                                       process_metadata_update
                                         //activate spare disk
      
                                       ... ...
      
                                       md_do_sync
                                        waiting to grab resync_lockres:EX
      ```
      
      After db5e653d:
      
      (note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
      starts md_start_sync, setting the INTR action will exacerbate the
      delay in node1 calling the md_do_sync function.)
      
      ```
         node1                                node2
      ----------------------------------------------------------------
      md_check_recovery
       + md_update_sb
       |  sendmsg: METADATA_UPDATED
       + calls mddev->sync_work         process_metadata_update
                                         //node1 has not finished adding
                                         //the spare disk:do nothing
      
      md_start_sync
       + md_choose_sync_action
       |  remove_and_add_spares
       + calls md_do_sync
      
      md_check_recovery
       md_update_sb
        sendmsg: METADATA_UPDATED
                                        process_metadata_update
                                          //activate spare disk
      
        ... ...                         ... ...
      
                                        md_do_sync
                                         + grabbed resync_lockres:EX
                                         + raid1_sync_request skip sync under
      				     conf->fullsync:0
      md_do_sync
       1. waiting to grab resync_lockres:EX
       2. when node1 could grab EX lock,
          node1 will skip resync under recovery_offset:MaxSector
      ```
      
      How to trigger:
      
      ```(commands @node1)
       # to easily watch the recovery status
      echo 2000 > /proc/sys/dev/raid/speed_limit_max
      ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"
      
      mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
      ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
      mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
      mdadm --manage /dev/md0 --add /dev/sdc
      
      === "cat /proc/mdstat" on both node, there are no recovery action. ===
      ```
      
      How to fix:
      
      because md layer code logic is hard to restore for speeding up sync job
      on local node, we add new cluster msg to pending the another node to
      active disk.
      Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Reviewed-by: default avatarSu Yue <glass.su@suse.com>
      Acked-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
      35a0a409
    • Heming Zhao's avatar
      md-cluster: fix hanging issue while a new disk adding · fff42f21
      Heming Zhao authored
      The commit 1bbe254e ("md-cluster: check for timeout while a
      new disk adding") is correct in terms of code syntax but not
      suite real clustered code logic.
      
      When a timeout occurs while adding a new disk, if recv_daemon()
      bypasses the unlock for ack_lockres:CR, another node will be waiting
      to grab EX lock. This will cause the cluster to hang indefinitely.
      
      How to fix:
      
      1. In dlm_lock_sync(), change the wait behaviour from forever to a
         timeout, This could avoid the hanging issue when another node
         fails to handle cluster msg. Another result of this change is
         that if another node receives an unknown msg (e.g. a new msg_type),
         the old code will hang, whereas the new code will timeout and fail.
         This could help cluster_md handle new msg_type from different
         nodes with different kernel/module versions (e.g. The user only
         updates one leg's kernel and monitors the stability of the new
         kernel).
      2. The old code for __sendmsg() always returns 0 (success) under the
         design (must successfully unlock ->message_lockres). This commit
         makes this function return an error number when an error occurs.
      
      Fixes: 1bbe254e ("md-cluster: check for timeout while a new disk adding")
      Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Reviewed-by: default avatarSu Yue <glass.su@suse.com>
      Acked-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
      fff42f21
  2. 10 Jul, 2024 4 commits
  3. 09 Jul, 2024 9 commits
  4. 08 Jul, 2024 4 commits
  5. 05 Jul, 2024 12 commits
  6. 04 Jul, 2024 8 commits