1. 25 Apr, 2022 11 commits
    • Logan Gunthorpe's avatar
      md/raid5: Annotate rdev/replacement access when mddev_lock is held · 9aeb7f99
      Logan Gunthorpe authored
      The mddev_lock should be held during raid5_remove_disk() which is when
      the rdev/replacement pointers are modified. So any access to these
      pointers marked __rcu should be safe whenever the mddev_lock is held.
      
      There are numerous such access that currently produce sparse warnings.
      Add a helper function, rdev_mdlock_deref() that wraps
      rcu_dereference_protected() in all these instances.
      
      This annotation fixes a number of sparse warnings.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      9aeb7f99
    • Logan Gunthorpe's avatar
      md/raid5: Annotate rdev/replacement accesses when nr_pending is elevated · e38b0432
      Logan Gunthorpe authored
      There are a number of accesses to __rcu variables that should be safe
      because nr_pending in the disk is known to be elevated.
      
      Create a wrapper around rcu_dereference_protected() to annotate these
      accesses and verify that nr_pending is non-zero.
      
      This fixes a number of sparse warnings.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      e38b0432
    • Logan Gunthorpe's avatar
      md/raid5: Add __rcu annotation to struct disk_info · b0920ede
      Logan Gunthorpe authored
      rdev and replacement are protected in some circumstances with
      rcu_dereference and synchronize_rcu (in raid5_remove_disk()). However,
      they were not annotated with __rcu so a sparse warning is emitted for
      every rcu_dereference() call.
      
      Add the __rcu annotation and fix up the initialization with
      RCU_INIT_POINTER, all pointer modifications with rcu_assign_pointer(),
      a few cases where the pointer value is tested with rcu_access_pointer()
      and one case where READ_ONCE() is used instead of rcu_dereference(),
      a case in print_raid5_conf() that should have rcu_dereference() and
      rcu_read_[un]lock() calls.
      
      Additional sparse issues will be fixed up in further commits.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      b0920ede
    • Logan Gunthorpe's avatar
      md/raid5: Un-nest struct raid5_percpu definition · 3d9a644c
      Logan Gunthorpe authored
      Sparse reports many warnings of the form:
        drivers/md/raid5.c:1476:16: warning: dereference of noderef expression
      
      This is because all struct raid5_percpu definitions get marked as
      __percpu when really only the pointer in r5conf should have that
      annotation.
      
      Fix this by moving the defnition of raid5_precpu out of the definition
      of struct r5conf.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3d9a644c
    • Logan Gunthorpe's avatar
      md/raid5: Cleanup setup_conf() error returns · 8fbcba6b
      Logan Gunthorpe authored
      Be more careful about the error returns. Most errors in this function
      are actually ENOMEM, but it forcibly returns EIO if conf has been
      allocated.
      
      Instead return ret and ensure it is set appropriately before each goto
      abort.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      8fbcba6b
    • Heming Zhao's avatar
      md: replace deprecated strlcpy & remove duplicated line · 92d9aac9
      Heming Zhao authored
      This commit includes two topics:
      
      1> replace deprecated strlcpy
      
      change strlcpy to strscpy for strlcpy is marked as deprecated in
      Documentation/process/deprecated.rst
      
      2> remove duplicated strlcpy line
      
      in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the
      history:
      
      - commit cf921cc1 ("Add node recovery callbacks") introduced the first
        usage of strlcpy().
      
      - commit b97e9257 ("Use separate bitmaps for each nodes in the cluster")
        introduced the second strlcpy(). this time, the two strlcpy() are same,
         we can remove anyone safely.
      
      - commit d3b178ad ("md: Skip cluster setup for dm-raid") added dm-raid
        special handling. And the "nodes" value is the key of this patch. but
        from this patch, strlcpy() which was introduced by b97e9257
        become necessary.
      
      - commit 3c462c88 ("md: Increment version for clustered bitmaps") used
        clustered major version to only handle in clustered env. this patch
        could look a polishment for clustered code logic.
      
      So cf921cc1 became useless after d3b178ad, we could remove it
      safely.
      Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      92d9aac9
    • Heming Zhao's avatar
      md/bitmap: don't set sb values if can't pass sanity check · e68cb83a
      Heming Zhao authored
      If bitmap area contains invalid data, kernel will crash then mdadm
      triggers "Segmentation fault".
      This is cluster-md speical bug. In non-clustered env, mdadm will
      handle broken metadata case. In clustered array, only kernel space
      handles bitmap slot info. But even this bug only happened in clustered
      env, current sanity check is wrong, the code should be changed.
      
      How to trigger: (faulty injection)
      
      dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda
      dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb
      mdadm -Ss
      echo aaa > magic.txt
       == below modifying slot 2 bitmap data ==
      dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic
      dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize
      mdadm -A /dev/md0 /dev/sda /dev/sdb
       == kernel crashes. mdadm outputs "Segmentation fault" ==
      
      Reason of kernel crash:
      
      In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't
      block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T()
      trigger "divide error".
      
      Crash log:
      
      kernel: md: md0 stopped.
      kernel: md/raid1:md0: not clean -- starting background reconstruction
      kernel: md/raid1:md0: active with 2 out of 2 mirrors
      kernel: dlm: ... ...
      kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1
      kernel: md0: invalid bitmap file superblock: bad magic
      kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2
      kernel: md-cluster: Could not gather bitmaps from slot 2
      kernel: divide error: 0000 [#1] SMP NOPTI
      kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default
      kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
      kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246
      kernel: ... ...
      kernel: Call Trace:
      kernel:  ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0]
      kernel:  md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a]
      kernel:  load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0]
      kernel:  md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a]
      kernel:  do_md_run+0x30/0x100 [md_mod 24ea..d3a]
      kernel:  md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a]
      kernel:  ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a]
      kernel:  ? blkdev_ioctl+0xb1/0x2b0
      kernel:  block_ioctl+0x3b/0x40
      kernel:  __x64_sys_ioctl+0x7f/0xb0
      kernel:  do_syscall_64+0x59/0x80
      kernel:  ? exit_to_user_mode_prepare+0x1ab/0x230
      kernel:  ? syscall_exit_to_user_mode+0x18/0x40
      kernel:  ? do_syscall_64+0x69/0x80
      kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      kernel: RIP: 0033:0x7f4a15fa722b
      kernel: ... ...
      kernel: ---[ end trace 8afa7612f559c868 ]---
      kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      e68cb83a
    • Xiaomeng Tong's avatar
      md: fix an incorrect NULL check in md_reload_sb · 64c54d92
      Xiaomeng Tong authored
      The bug is here:
      	if (!rdev || rdev->desc_nr != nr) {
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each_rcu(), so it is incorrect to assume that the
      iterator value will be NULL if the list is empty or no element
      found (In fact, it will be a bogus pointer to an invalid struct
      object containing the HEAD). Otherwise it will bypass the check
      and lead to invalid memory access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'pdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 70bcecdb ("md-cluster: Improve md_reload_sb to be less error prone")
      Signed-off-by: default avatarXiaomeng Tong <xiam0nd.tong@gmail.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      64c54d92
    • Xiaomeng Tong's avatar
      md: fix an incorrect NULL check in does_sb_need_changing · fc873834
      Xiaomeng Tong authored
      The bug is here:
      	if (!rdev)
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each(), so it is incorrect to assume that the iterator
      value will be NULL if the list is empty or no element found.
      Otherwise it will bypass the NULL check and lead to invalid memory
      access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'rdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 2aa82191 ("md-cluster: Perform a lazy update")
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarXiaomeng Tong <xiam0nd.tong@gmail.com>
      Acked-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      fc873834
    • Mariusz Tkaczyk's avatar
      raid5: introduce MD_BROKEN · 57668f0a
      Mariusz Tkaczyk authored
      Raid456 module had allowed to achieve failed state. It was fixed by
      fb73b357 ("raid5: block failing device if raid will be failed").
      This fix introduces a bug, now if raid5 fails during IO, it may result
      with a hung task without completion. Faulty flag on the device is
      necessary to process all requests and is checked many times, mainly in
      analyze_stripe().
      Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
      
      As a result, this level is allowed to achieve failed state again, but
      communication with userspace (via -EBUSY status) will be preserved.
      
      This restores possibility to fail array via #mdadm --set-faulty command
      and will be fixed by additional verification on mdadm side.
      
      Reproduction steps:
       mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
       mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
       mkfs.xfs /dev/md126 -f
       mount /dev/md126 /mnt/root/
      
       fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
      --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
      --time_based --group_reporting --name=throughput-test-job
      --eta-newline=1 &
      
       echo 1 > /sys/block/nvme2n1/device/device/remove
       echo 1 > /sys/block/nvme1n1/device/device/remove
      
       [ 1475.787779] Call Trace:
       [ 1475.793111] __schedule+0x2a6/0x700
       [ 1475.799460] schedule+0x38/0xa0
       [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
       [ 1475.813856] ? finish_wait+0x80/0x80
       [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
       [ 1475.828281] ? finish_wait+0x80/0x80
       [ 1475.834727] ? finish_wait+0x80/0x80
       [ 1475.841127] ? finish_wait+0x80/0x80
       [ 1475.847480] md_handle_request+0x119/0x190
       [ 1475.854390] md_make_request+0x8a/0x190
       [ 1475.861041] generic_make_request+0xcf/0x310
       [ 1475.868145] submit_bio+0x3c/0x160
       [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
       [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
       [ 1475.889149] iomap_apply+0xff/0x310
       [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.909974] iomap_dio_rw+0x2f2/0x490
       [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.923680] ? atime_needs_update+0x77/0xe0
       [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
       [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
       [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
       [ 1475.953403] aio_read+0xd5/0x180
       [ 1475.959395] ? _cond_resched+0x15/0x30
       [ 1475.965907] io_submit_one+0x20b/0x3c0
       [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
       [ 1475.979335] ? do_io_getevents+0x7c/0xc0
       [ 1475.986009] do_syscall_64+0x5b/0x1a0
       [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
       [ 1476.000255] RIP: 0033:0x7f11fc27978d
       [ 1476.006631] Code: Bad RIP value.
       [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
      
      Cc: stable@vger.kernel.org
      Fixes: fb73b357 ("raid5: block failing device if raid will be failed")
      Reviewd-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      57668f0a
    • Mariusz Tkaczyk's avatar
      md: Set MD_BROKEN for RAID1 and RAID10 · 9631abdb
      Mariusz Tkaczyk authored
      There is no direct mechanism to determine raid failure outside
      personality. It is done by checking rdev->flags after executing
      md_error(). If "faulty" flag is not set then -EBUSY is returned to
      userspace. -EBUSY means that array will be failed after drive removal.
      
      Mdadm has special routine to handle the array failure and it is executed
      if -EBUSY is returned by md.
      
      There are at least two known reasons to not consider this mechanism
      as correct:
      1. drive can be removed even if array will be failed[1].
      2. -EBUSY seems to be wrong status. Array is not busy, but removal
         process cannot proceed safe.
      
      -EBUSY expectation cannot be removed without breaking compatibility
      with userspace. In this patch first issue is resolved by adding support
      for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
      next commit.
      
      The idea is to set the MD_BROKEN if we are sure that raid is in failed
      state now. This is done in each error_handler(). In md_error() MD_BROKEN
      flag is checked. If is set, then -EBUSY is returned to userspace.
      
      As in previous commit, it causes that #mdadm --set-faulty is able to
      fail array. Previously proposed workaround is valid if optional
      functionality[1] is disabled.
      
      [1] commit 9a567843("md: allow last device to be forcibly removed from
          RAID1/RAID10.")
      Reviewd-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      9631abdb
  2. 18 Apr, 2022 29 commits