1. 25 Apr, 2022 5 commits
    • Heming Zhao's avatar
      md/bitmap: don't set sb values if can't pass sanity check · e68cb83a
      Heming Zhao authored
      If bitmap area contains invalid data, kernel will crash then mdadm
      triggers "Segmentation fault".
      This is cluster-md speical bug. In non-clustered env, mdadm will
      handle broken metadata case. In clustered array, only kernel space
      handles bitmap slot info. But even this bug only happened in clustered
      env, current sanity check is wrong, the code should be changed.
      
      How to trigger: (faulty injection)
      
      dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda
      dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb
      mdadm -Ss
      echo aaa > magic.txt
       == below modifying slot 2 bitmap data ==
      dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic
      dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize
      mdadm -A /dev/md0 /dev/sda /dev/sdb
       == kernel crashes. mdadm outputs "Segmentation fault" ==
      
      Reason of kernel crash:
      
      In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't
      block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T()
      trigger "divide error".
      
      Crash log:
      
      kernel: md: md0 stopped.
      kernel: md/raid1:md0: not clean -- starting background reconstruction
      kernel: md/raid1:md0: active with 2 out of 2 mirrors
      kernel: dlm: ... ...
      kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1
      kernel: md0: invalid bitmap file superblock: bad magic
      kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2
      kernel: md-cluster: Could not gather bitmaps from slot 2
      kernel: divide error: 0000 [#1] SMP NOPTI
      kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default
      kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
      kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246
      kernel: ... ...
      kernel: Call Trace:
      kernel:  ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0]
      kernel:  md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a]
      kernel:  load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0]
      kernel:  md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a]
      kernel:  do_md_run+0x30/0x100 [md_mod 24ea..d3a]
      kernel:  md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a]
      kernel:  ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a]
      kernel:  ? blkdev_ioctl+0xb1/0x2b0
      kernel:  block_ioctl+0x3b/0x40
      kernel:  __x64_sys_ioctl+0x7f/0xb0
      kernel:  do_syscall_64+0x59/0x80
      kernel:  ? exit_to_user_mode_prepare+0x1ab/0x230
      kernel:  ? syscall_exit_to_user_mode+0x18/0x40
      kernel:  ? do_syscall_64+0x69/0x80
      kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      kernel: RIP: 0033:0x7f4a15fa722b
      kernel: ... ...
      kernel: ---[ end trace 8afa7612f559c868 ]---
      kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      e68cb83a
    • Xiaomeng Tong's avatar
      md: fix an incorrect NULL check in md_reload_sb · 64c54d92
      Xiaomeng Tong authored
      The bug is here:
      	if (!rdev || rdev->desc_nr != nr) {
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each_rcu(), so it is incorrect to assume that the
      iterator value will be NULL if the list is empty or no element
      found (In fact, it will be a bogus pointer to an invalid struct
      object containing the HEAD). Otherwise it will bypass the check
      and lead to invalid memory access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'pdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 70bcecdb ("md-cluster: Improve md_reload_sb to be less error prone")
      Signed-off-by: default avatarXiaomeng Tong <xiam0nd.tong@gmail.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      64c54d92
    • Xiaomeng Tong's avatar
      md: fix an incorrect NULL check in does_sb_need_changing · fc873834
      Xiaomeng Tong authored
      The bug is here:
      	if (!rdev)
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each(), so it is incorrect to assume that the iterator
      value will be NULL if the list is empty or no element found.
      Otherwise it will bypass the NULL check and lead to invalid memory
      access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'rdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 2aa82191 ("md-cluster: Perform a lazy update")
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarXiaomeng Tong <xiam0nd.tong@gmail.com>
      Acked-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      fc873834
    • Mariusz Tkaczyk's avatar
      raid5: introduce MD_BROKEN · 57668f0a
      Mariusz Tkaczyk authored
      Raid456 module had allowed to achieve failed state. It was fixed by
      fb73b357 ("raid5: block failing device if raid will be failed").
      This fix introduces a bug, now if raid5 fails during IO, it may result
      with a hung task without completion. Faulty flag on the device is
      necessary to process all requests and is checked many times, mainly in
      analyze_stripe().
      Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
      
      As a result, this level is allowed to achieve failed state again, but
      communication with userspace (via -EBUSY status) will be preserved.
      
      This restores possibility to fail array via #mdadm --set-faulty command
      and will be fixed by additional verification on mdadm side.
      
      Reproduction steps:
       mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
       mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
       mkfs.xfs /dev/md126 -f
       mount /dev/md126 /mnt/root/
      
       fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
      --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
      --time_based --group_reporting --name=throughput-test-job
      --eta-newline=1 &
      
       echo 1 > /sys/block/nvme2n1/device/device/remove
       echo 1 > /sys/block/nvme1n1/device/device/remove
      
       [ 1475.787779] Call Trace:
       [ 1475.793111] __schedule+0x2a6/0x700
       [ 1475.799460] schedule+0x38/0xa0
       [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
       [ 1475.813856] ? finish_wait+0x80/0x80
       [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
       [ 1475.828281] ? finish_wait+0x80/0x80
       [ 1475.834727] ? finish_wait+0x80/0x80
       [ 1475.841127] ? finish_wait+0x80/0x80
       [ 1475.847480] md_handle_request+0x119/0x190
       [ 1475.854390] md_make_request+0x8a/0x190
       [ 1475.861041] generic_make_request+0xcf/0x310
       [ 1475.868145] submit_bio+0x3c/0x160
       [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
       [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
       [ 1475.889149] iomap_apply+0xff/0x310
       [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.909974] iomap_dio_rw+0x2f2/0x490
       [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.923680] ? atime_needs_update+0x77/0xe0
       [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
       [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
       [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
       [ 1475.953403] aio_read+0xd5/0x180
       [ 1475.959395] ? _cond_resched+0x15/0x30
       [ 1475.965907] io_submit_one+0x20b/0x3c0
       [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
       [ 1475.979335] ? do_io_getevents+0x7c/0xc0
       [ 1475.986009] do_syscall_64+0x5b/0x1a0
       [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
       [ 1476.000255] RIP: 0033:0x7f11fc27978d
       [ 1476.006631] Code: Bad RIP value.
       [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
      
      Cc: stable@vger.kernel.org
      Fixes: fb73b357 ("raid5: block failing device if raid will be failed")
      Reviewd-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      57668f0a
    • Mariusz Tkaczyk's avatar
      md: Set MD_BROKEN for RAID1 and RAID10 · 9631abdb
      Mariusz Tkaczyk authored
      There is no direct mechanism to determine raid failure outside
      personality. It is done by checking rdev->flags after executing
      md_error(). If "faulty" flag is not set then -EBUSY is returned to
      userspace. -EBUSY means that array will be failed after drive removal.
      
      Mdadm has special routine to handle the array failure and it is executed
      if -EBUSY is returned by md.
      
      There are at least two known reasons to not consider this mechanism
      as correct:
      1. drive can be removed even if array will be failed[1].
      2. -EBUSY seems to be wrong status. Array is not busy, but removal
         process cannot proceed safe.
      
      -EBUSY expectation cannot be removed without breaking compatibility
      with userspace. In this patch first issue is resolved by adding support
      for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
      next commit.
      
      The idea is to set the MD_BROKEN if we are sure that raid is in failed
      state now. This is done in each error_handler(). In md_error() MD_BROKEN
      flag is checked. If is set, then -EBUSY is returned to userspace.
      
      As in previous commit, it causes that #mdadm --set-faulty is able to
      fail array. Previously proposed workaround is valid if optional
      functionality[1] is disabled.
      
      [1] commit 9a567843("md: allow last device to be forcibly removed from
          RAID1/RAID10.")
      Reviewd-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      9631abdb
  2. 18 Apr, 2022 35 commits