An error occurred fetching the project authors.
- 22 Sep, 2023 2 commits
-
-
Yu Kuai authored
There are no functional changes, on the one hand make the code cleaner, on the other hand prevent following checkpatch error in the next patch to delay choosing sync action to md_start_sync(). ERROR: do not use assignment in if condition + } else if ((spares = remove_and_add_spares(mddev, NULL))) { Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-3-yukuai1@huaweicloud.com
-
Yu Kuai authored
It's a little weird to borrow 'del_work' for md_start_sync(), declare a new work_struct 'sync_work' for md_start_sync(). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com
-
- 14 Sep, 2023 1 commit
-
-
Mariusz Tkaczyk authored
If there are multiple arrays in system and one mddevice is marked with MD_DELETED and md_seq_next() is called in the middle of removal then it _get()s proper device but it may _put() deleted one. As a result, active counter may never be zeroed for mddevice and it cannot be removed. Put the device which has been _get with previous md_seq_next() call. Cc: stable@vger.kernel.org Fixes: 12a6caf2 ("md: only delete entries from all_mddevs when the disk is freed") Reported-by:
AceLan Kao <acelan@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217798 Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914152416.10819-1-mariusz.tkaczyk@linux.intel.com
-
- 08 Sep, 2023 2 commits
-
-
Yu Kuai authored
Commit a1d76719 ("md: use mddev->external to select holder in export_rdev()") fix the problem that 'claim_rdev' is used for blkdev_get_by_dev() while 'rdev' is used for blkdev_put(). However, if mddev->external is changed from 0 to 1, then 'rdev' is used for blkdev_get_by_dev() while 'claim_rdev' is used for blkdev_put(). And this problem can be reporduced reliably by following: New file: mdadm/tests/23rdev-lifetime devname=${dev0##*/} devt=`cat /sys/block/$devname/dev` pid="" runtime=2 clean_up_test() { pill -9 $pid echo clear > /sys/block/md0/md/array_state } trap 'clean_up_test' EXIT add_by_sysfs() { while true; do echo $devt > /sys/block/md0/md/new_dev done } remove_by_sysfs(){ while true; do echo remove > /sys/block/md0/md/dev-${devname}/state done } echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed" add_by_sysfs & pid="$pid $!" remove_by_sysfs & pid="$pid $!" sleep $runtime exit 0 Test cmd: ./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime Test result: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 960 at block/bdev.c:618 blkdev_put+0x27c/0x330 Modules linked in: multipath md_mod loop CPU: 0 PID: 960 Comm: test Not tainted 6.5.0-rc2-00121-g01e55c376936-dirty #50 RIP: 0010:blkdev_put+0x27c/0x330 Call Trace: <TASK> export_rdev.isra.23+0x50/0xa0 [md_mod] mddev_unlock+0x19d/0x300 [md_mod] rdev_attr_store+0xec/0x190 [md_mod] sysfs_kf_write+0x52/0x70 kernfs_fop_write_iter+0x19a/0x2a0 vfs_write+0x3b5/0x770 ksys_write+0x74/0x150 __x64_sys_write+0x22/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix the problem by recording if 'rdev' is used as holder. Fixes: a1d76719 ("md: use mddev->external to select holder in export_rdev()") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825025532.1523008-3-yukuai1@huaweicloud.com
-
Yu Kuai authored
Except for initial reference, mddev->kobject is referenced by rdev->kobject, and if the last rdev is freed, there is no guarantee that mddev is still valid. Hence mddev should not be used anymore after export_rdev(). This problem can be triggered by following test for mdadm at very low rate: New file: mdadm/tests/23rdev-lifetime devname=${dev0##*/} devt=`cat /sys/block/$devname/dev` pid="" runtime=2 clean_up_test() { pill -9 $pid echo clear > /sys/block/md0/md/array_state } trap 'clean_up_test' EXIT add_by_sysfs() { while true; do echo $devt > /sys/block/md0/md/new_dev done } remove_by_sysfs(){ while true; do echo remove > /sys/block/md0/md/dev-${devname}/state done } echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed" add_by_sysfs & pid="$pid $!" remove_by_sysfs & pid="$pid $!" sleep $runtime exit 0 Test cmd: ./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime Test result: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bcb: 0000 [#4] PREEMPT SMP CPU: 0 PID: 1292 Comm: test Tainted: G D W 6.5.0-rc2-00121-g01e55c376936 #562 RIP: 0010:md_wakeup_thread+0x9e/0x320 [md_mod] Call Trace: <TASK> mddev_unlock+0x1b6/0x310 [md_mod] rdev_attr_store+0xec/0x190 [md_mod] sysfs_kf_write+0x52/0x70 kernfs_fop_write_iter+0x19a/0x2a0 vfs_write+0x3b5/0x770 ksys_write+0x74/0x150 __x64_sys_write+0x22/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix this problem by don't dereference mddev after export_rdev(). Fixes: 3ce94ce5 ("md: fix duplicate filename for rdev") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825025532.1523008-2-yukuai1@huaweicloud.com
-
- 15 Aug, 2023 1 commit
-
-
Li Lingfeng authored
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by:
Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.comSigned-off-by:
Song Liu <song@kernel.org>
-
- 27 Jul, 2023 12 commits
-
-
Yu Kuai authored
memalloc_noio_save() is called for the first mddev_suspend(), and repeated mddev_suspend() only increase 'suspended'. However, memalloc_noio_restore() is also called for the first mddev_resume(), which means that memory reclaim will be enabled before the last mddev_resume() is called, while the array is still suspended. Fix this problem by restore 'noio_flag' for the last mddev_resume(). Fixes: 78f57ef9 ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.comSigned-off-by:
Song Liu <song@kernel.org>
-
Yu Kuai authored
Some levels doesn't implement "pers->quiesce", for example raid0_quiesce() is empty, and now that all levels will drop 'active_io' until io is done, wait for 'active_io' to be 0 is enough to make sure all normal io is done, and percpu_ref_kill() for 'active_io' will make sure no new normal io can be dispatched. There is no need to call "pers->quiesce" anymore from mddev_suspend(). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-2-yukuai1@huaweicloud.comSigned-off-by:
Song Liu <song@kernel.org>
-
Yu Kuai authored
Currently, 'active_io' is grabbed before make_reqeust() is called, and it's dropped immediately make_reqeust() returns. Hence 'active_io' actually means io is dispatching, not io is inflight. For raid0 and raid456 that io accounting is enabled, 'active_io' will also be grabbed when bio is cloned for io accounting, and this 'active_io' is dropped until io is done. Always clone new bio so that 'active_io' will mean that io is inflight, raid1 and raid10 will switch to use this method in later patches. Now that bio will be cloned even if io accounting is disabled, also rename related structure from '*_acct_*' to '*_clone_*'. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com
-
Yu Kuai authored
'io_acct_set' is only used for raid0 and raid456, prepare to use it for raid1 and raid10, so that io accounting from different levels can be consistent. By the way, follow up patches will also use this io clone mechanism to make sure 'active_io' represents in flight io, not io that is dispatching, so that mddev_suspend will wait for io to be done as designed. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
-
Christoph Hellwig authored
The support for bitmaps on files is a very bad idea abusing various kernel APIs, and fundamentally requires the file to not be on the actual array without a way to check that this is actually the case. Add a deprecation warning to see if we might be able to eventually drop it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-12-hch@lst.de
-
Christoph Hellwig authored
The support for write intent bitmaps in files on an external files in md is a hot mess that abuses ->bmap to map file offsets into physical device objects, and also abuses buffer_heads in a creative way. Make this code optional so that MD can be built into future kernels without buffer_head support, and so that we can eventually deprecate it. Note this does not affect the internal bitmap support, which has none of the problems. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-11-hch@lst.de
-
Yu Kuai authored
For md_check_recovery(): 1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread. 2) if 'MD_RECOVERY_RUNING' is set: a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for md_do_sync() to be done. b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code expects that sync_thread is not NULL, otherwise new sync_thread will be registered, which will corrupt the array. Make sure md_check_recovery() won't register new sync_thread if 'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for the above corruption, Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-7-yukuai1@huaweicloud.com
-
Yu Kuai authored
md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...) from action_store(), just make sure action_store() will still wait for everything to be done in md_reap_sync_thread(). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewd-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-6-yukuai1@huaweicloud.com
-
Yu Kuai authored
Our test found a following deadlock in raid10: 1) Issue a normal write, and such write failed: raid10_end_write_request set_bit(R10BIO_WriteError, &r10_bio->state) one_write_done reschedule_retry // later from md thread raid10d handle_write_completed list_add(&r10_bio->retry_list, &conf->bio_end_io_list) // later from md thread raid10d if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) list_move(conf->bio_end_io_list.prev, &tmp) r10_bio = list_first_entry(&tmp, struct r10bio, retry_list) raid_end_bio_io(r10_bio) Dependency chain 1: normal io is waiting for updating superblock 2) Trigger a recovery: raid10_sync_request raise_barrier Dependency chain 2: sync thread is waiting for normal io 3) echo idle/frozen to sync_action: action_store mddev_lock md_unregister_thread kthread_stop Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread 4) md thread can't update superblock: raid10d md_check_recovery if (mddev_trylock(mddev)) md_update_sb Dependency chain 4: update superblock is waiting for 'reconfig_mutex' Hence cyclic dependency exist, in order to fix the problem, we must break one of them. Dependency 1 and 2 can't be broken because they are foundation design. Dependency 4 may be possible if it can be guaranteed that no io can be inflight, however, this requires a new mechanism which seems complex. Dependency 3 is a good choice, because idle/frozen only requires sync thread to finish, which can be done asynchronously that is already implemented, and 'reconfig_mutex' is not needed anymore. This patch switch 'idle' and 'frozen' to wait sync thread to be done asynchronously, and this patch also add a sequence counter to record how many times sync thread is done, so that 'idle' won't keep waiting on new started sync thread. Noted that raid456 has similiar deadlock([1]), and it's verified[2] this deadlock can be fixed by this patch as well. [1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t [2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
-
Yu Kuai authored
Currently, for idle and frozen, action_store will hold 'reconfig_mutex' and call md_reap_sync_thread() to stop sync thread, however, this will cause deadlock (explained in the next patch). In order to fix the problem, following patch will release 'reconfig_mutex' and wait on 'resync_wait', like md_set_readonly() and do_md_stop() does. Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN' unconditionally, which might cause unexpected problems, for example, frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while 'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which might starve in progress frozen. A mutex is added to synchronize idle and frozen from action_store(). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
-
Yu Kuai authored
Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there are no functional changes except that MD_RECOVERY_RUNNING is checked again after 'reconfig_mutex' is held. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-3-yukuai1@huaweicloud.com
-
Yu Kuai authored
This reverts commit 9dfbdafd. Because it will introduce a defect that sync_thread can be running while MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems, for example: list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0). Call trace: __list_add_valid+0xfc/0x140 insert_work+0x78/0x1a0 __queue_work+0x500/0xcf4 queue_work_on+0xe8/0x12c md_check_recovery+0xa34/0xf30 raid10d+0xb8/0x900 [raid10] md_thread+0x16c/0x2cc kthread+0x1a4/0x1ec ret_from_fork+0x10/0x18 This is because work is requeued while it's still inside workqueue: t1: t2: action_store mddev_lock if (mddev->sync_thread) mddev_unlock md_unregister_thread // first sync_thread is done md_check_recovery mddev_try_lock /* * once MD_RECOVERY_DONE is set, new sync_thread * can start. */ set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) INIT_WORK(&mddev->del_work, md_start_sync) queue_work(md_misc_wq, &mddev->del_work) test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...) // set pending bit insert_work list_add_tail mddev_unlock mddev_lock_nointr md_reap_sync_thread // MD_RECOVERY_RUNNING is cleared mddev_unlock t3: // before queued work started from t2 md_check_recovery // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started INIT_WORK(&mddev->del_work, md_start_sync) work->data = 0 // work pending bit is cleared queue_work(md_misc_wq, &mddev->del_work) insert_work list_add_tail // list is corrupted The above commit is reverted to fix the problem, the deadlock this commit tries to fix will be fixed in following patches. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com
-
- 25 Jul, 2023 1 commit
-
-
Yu Kuai authored
__md_stop_writes() and __md_stop() will modify many fields that are protected by 'reconfig_mutex', and all the callers will grab 'reconfig_mutex' except for md_stop(). Also, update md_stop() to make certain 'reconfig_mutex' is held using lockdep_assert_held(). Fixes: 9d09e663 ("dm: raid456 basic support") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Mike Snitzer <snitzer@kernel.org>
-
- 23 Jun, 2023 2 commits
-
-
Yu Kuai authored
Commit 3ce94ce5 ("md: fix duplicate filename for rdev") introduce a new lock 'delete_mutex', and trigger a new deadlock: t1: remove rdev t2: sysfs writer rdev_attr_store rdev_attr_store mddev_lock state_store md_kick_rdev_from_array lock delete_mutex list_add mddev->deleting unlock delete_mutex mddev_unlock mddev_lock ... lock delete_mutex kobject_del // wait for sysfs writers to be done mddev_unlock lock delete_mutex // wait for delete_mutex, deadlock 'delete_mutex' is used to protect the list 'mddev->deleting', turns out that this list can be protected by 'reconfig_mutex' directly, and this lock can be removed. Fix this problem by removing the lock, and use 'reconfig_mutex' to protect the list. mddev_unlock() will move this list to a local list to be handled after 'reconfig_mutex' is dropped. Fixes: 3ce94ce5 ("md: fix duplicate filename for rdev") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621142933.1395629-1-yukuai1@huaweicloud.com
-
Song Liu authored
mdadm test "10ddf-create-fail-rebuild" triggers warnings like the following [ 215.526357] ------------[ cut here ]------------ [ 215.527243] WARNING: CPU: 18 PID: 1264 at block/bdev.c:617 blkdev_put+0x269/0x350 [ 215.528334] Modules linked in: [ 215.528806] CPU: 18 PID: 1264 Comm: mdmon Not tainted 6.4.0-rc2+ #768 [ 215.529863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS [ 215.531464] RIP: 0010:blkdev_put+0x269/0x350 [ 215.532167] Code: ff ff 49 8d 7d 10 e8 56 bf b8 ff 4d 8b 65 10 49 8d bc 24 58 05 00 00 e8 05 be b8 ff 41 83 ac 24 58 05 00 00 01 e9 44 ff ff ff <0f> 0b e9 52 fe ff ff 0f 0b e9 6b fe ff ff1 [ 215.534780] RSP: 0018:ffffc900040bfbf0 EFLAGS: 00010283 [ 215.535635] RAX: ffff888174001000 RBX: ffff88810b1c3b00 RCX: ffffffff819a4061 [ 215.536645] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88810b1c3ba0 [ 215.537657] RBP: ffff88810dbde800 R08: fffffbfff0fca983 R09: fffffbfff0fca983 [ 215.538674] R10: ffffc900040bfbf0 R11: fffffbfff0fca982 R12: ffff88810b1c3b38 [ 215.539687] R13: ffff88810b1c3b10 R14: ffff88810dbdecb8 R15: ffff88810b1c3b00 [ 215.540833] FS: 00007f2aabdff700(0000) GS:ffff888dfb400000(0000) knlGS:0000000000000000 [ 215.541961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 215.542775] CR2: 00007fa19a85d934 CR3: 000000010c076006 CR4: 0000000000370ee0 [ 215.543814] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 215.544840] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 215.545885] Call Trace: [ 215.546257] <TASK> [ 215.546608] export_rdev.isra.63+0x71/0xe0 [ 215.547338] mddev_unlock+0x1b1/0x2d0 [ 215.547898] array_state_store+0x28d/0x450 [ 215.548519] md_attr_store+0xd7/0x150 [ 215.549059] ? __pfx_sysfs_kf_write+0x10/0x10 [ 215.549702] kernfs_fop_write_iter+0x1b9/0x260 [ 215.550351] vfs_write+0x491/0x760 [ 215.550863] ? __pfx_vfs_write+0x10/0x10 [ 215.551445] ? __fget_files+0x156/0x230 [ 215.552053] ksys_write+0xc0/0x160 [ 215.552570] ? __pfx_ksys_write+0x10/0x10 [ 215.553141] ? ktime_get_coarse_real_ts64+0xec/0x100 [ 215.553878] do_syscall_64+0x3a/0x90 [ 215.554403] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 215.555125] RIP: 0033:0x7f2aade11847 [ 215.555696] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 448 [ 215.558398] RSP: 002b:00007f2aabdfeba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 215.559516] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f2aade11847 [ 215.560515] RDX: 0000000000000005 RSI: 0000000000438b8b RDI: 0000000000000010 [ 215.561512] RBP: 0000000000438b8b R08: 0000000000000000 R09: 00007f2aaecf0060 [ 215.562511] R10: 000000000e3ba40b R11: 0000000000000293 R12: 0000000000000005 [ 215.563647] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000c70750 [ 215.564693] </TASK> [ 215.565029] irq event stamp: 15979 [ 215.565584] hardirqs last enabled at (15991): [<ffffffff811a7432>] __up_console_sem+0x52/0x60 [ 215.566806] hardirqs last disabled at (16000): [<ffffffff811a7417>] __up_console_sem+0x37/0x60 [ 215.568022] softirqs last enabled at (15716): [<ffffffff8277a2db>] __do_softirq+0x3eb/0x531 [ 215.569239] softirqs last disabled at (15711): [<ffffffff810d8f45>] irq_exit_rcu+0x115/0x160 [ 215.570434] ---[ end trace 0000000000000000 ]--- This means export_rdev() calls blkdev_put with a different holder than the one used by blkdev_get_by_dev(). This is because mddev->major_version == -2 is not a good check for external metadata. Fix this by using mddev->external instead. Also, do not clear mddev->external in md_clean(), as the flag might be used later in export_rdev(). Fixes: 2736e8ee ("block: use the holder as indication for exclusive opens") Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Song Liu <song@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230617052405.305871-1-song@kernel.org
-
- 13 Jun, 2023 11 commits
-
-
Yu Kuai authored
If bitmap is enabled, bitmap must update before submitting write io, this is why unplug callback must move these io to 'conf->pending_io_list' if 'current->bio_list' is not empty, which will suffer performance degradation. A new helper md_bitmap_unplug_async() is introduced to submit bitmap io in a kworker, so that submit bitmap io in raid10_unplug() doesn't require that 'current->bio_list' is empty. This patch prepare to limit the number of plugged bio. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
-
Li Nan authored
Commit 1a855a06 ("md: fix bug with re-adding of partially recovered device.") only add device which is set to In_sync. But it let devices without metadata cannot be added when they should be. Commit bf572541 ("md: fix regression with re-adding devices to arrays with no metadata") fix the above issue, it set device without metadata to In_sync when add new disk. However, after commit f466722c ("md: Change handling of save_raid_disk and metadata update during recovery.") deletes changes of the first patch, setting In_sync for devcie without metadata is meanless because the flag will be cleared soon and will not be used during this period. Clean it up. Signed-off-by:
Li Nan <linan122@huawei.com> Reviewed-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com
-
Yu Kuai authored
Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
-
Yu Kuai authored
md_wakeup_thread() can't wakeup md_thread->tsk if md_thread->run is still in progress, and in some cases md_thread->tsk need to be woke up directly, like md_set_readonly() and do_md_stop(). Commit 9dfbdafd ("md: unlock mddev before reap sync_thread in action_store") introduce a new scenario where unregister sync_thread is not protected by 'reconfig_mutex', this can cause null-ptr-deference in theroy: t1: md_set_readonly t2: action_store md_unregister_thread // 'reconfig_mutex' is not held // 'reconfig_mutex' is held by caller if (mddev->sync_thread) thread = *threadp *threadp = NULL wake_up_process(mddev->sync_thread->tsk) // null-ptr-deference Fix this problem by factoring out a helper to wake up md_thread directly, so that 'sync_thread' won't be accessed multiple times from the reader side. This helper also prepare to protect md_thread with rcu. Noted that later patches is going to fix that unregister sync_thread is not protected by 'reconfig_mutex' from action_store(). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-2-yukuai1@huaweicloud.com
-
Yu Kuai authored
Commit 5792a285 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs") delays the deletion of rdev, however, this introduces a window that rdev can be added again while the deletion is not done yet, and sysfs will complain about duplicate filename. Follow up patches try to fix this problem by flushing workqueue, however, flush_rdev_wq() is just dead code, the progress in md_kick_rdev_from_array(): 1) list_del_rcu(&rdev->same_set); 2) synchronize_rcu(); 3) queue_work(md_rdev_misc_wq, &rdev->del_work); So in flush_rdev_wq(), if rdev is found in the list, work_pending() can never pass, in the meantime, if work is queued, then rdev can never be found in the list. flush_rdev_wq() can be replaced by flush_workqueue() directly, however, this approach is not good: - the workqueue is global, this synchronization for all raid disks is not necessary. - flush_workqueue can't be called under 'reconfig_mutex', there is still a small window between flush_workqueue() and mddev_lock() that other contexts can queue new work, hence the problem is not solved completely. sysfs already has apis to support delete itself through writer, and these apis, specifically sysfs_break/unbreak_active_protection(), is used to support deleting rdev synchronously. Therefore, the above commit can be reverted, and sysfs duplicate filename can be avoided. A new mdadm regression test is proposed as well([1]). [1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/ Fixes: 5792a285 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
-
Li Nan authored
There is no input check when echo md/max_read_errors and overflow might occur. Add check of input number. Fixes: 1e50915f ("raid: improve MD/raid10 handling of correctable read errors.") Signed-off-by:
Li Nan <linan122@huawei.com> Reviewed-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-3-linan666@huaweicloud.com
-
Li Nan authored
There is no input check when echo md/safe_mode_delay in safe_delay_store(). And msec might also overflow when HZ < 1000 in safe_delay_show(), Fix it by checking overflow in safe_delay_store() and use unsigned long conversion in safe_delay_show(). Fixes: 72e02075 ("md: factor out parsing of fixed-point numbers") Signed-off-by:
Li Nan <linan122@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-2-linan666@huaweicloud.com
-
Yu Kuai authored
If reshape is in progress and io across reshape_position is issued, such io will wait for reshape to make progress(see details in the case that make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY). It has been reported several times that if system reboot while growing raid5 to raid6, array assemble will hang infinitely([1, 2]). This is because following deadlock is triggered: 1) a normal io is waiting for reshape to progress, this io can be from system-udevd or mdadm. 2) while assemble, mdadm tries to suspend the array, hence 'reconfig_mutex' is held and mddev_suspend() must wait for normal io to be done. 3) daemon thread can't start reshape because 'reconfig_mutex' can't be held. 1) and 3) is unbreakable because they're foundation design. In order to break 2), following is possible solutions that I can think of: a) Let mddev_suspend() fail is not a good option, because this will break many scenarios since mddev_suspend() doesn't fail before. b) Fail the io that is waiting for reshape to make progress from mddev_suspend(). c) Return false for the io that is waiting for reshape to make progress from raid5_make_request(), and these io will wait for suspend to be done in md_handle_request(), where 'active_io' is not grabbed. c) sounds better than b), however, b) is used because it's easy and straightforward, and it's verified that mdadm can assemble in this case. On the other hand, c) breaks the logic that mddev_suspend() will wait for submitted io to be completely handled. Fix the problem by checking reshape in mddev_suspend(), if reshape can't make progress and there are still some io waiting for reshape, fail those io. [1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/ [2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/Reported-by:
Jove <jovetoo@gmail.com> Reported-by:
David Gilmour <dgilmour76@gmail.com> Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-6-yukuai1@huaweicloud.com
-
Yu Kuai authored
There are no functional changes, the new api will be used later to do special handling for raid456 in md_suspend(). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-5-yukuai1@huaweicloud.com
-
Yu Kuai authored
The two apis will be used later to fix a deadlock in raid456, there are no functional changes. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-4-yukuai1@huaweicloud.com
-
Yu Kuai authored
Currently, if reshape is interrupted, echo "reshape" to sync_action will restart reshape from scratch, for example: echo frozen > sync_action echo reshape > sync_action This will corrupt data before reshape_position if the array is growing, fix the problem by continue reshape from reshape_position. Reported-by:
Peter Neuwirth <reddunur@online.de> Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@online.de/Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-3-yukuai1@huaweicloud.com
-
- 12 Jun, 2023 5 commits
-
-
Christoph Hellwig authored
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
->open is only called on the whole device. Make that explicit by passing a gendisk instead of the block_device. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bdev_check_media_change should only ever be called for the whole device. Pass a gendisk to make that explicit and rename the function to disk_check_media_change. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 05 Jun, 2023 1 commit
-
-
Christoph Hellwig authored
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Acked-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 31 May, 2023 1 commit
-
-
Johannes Thumshirn authored
The md-raid superblock writing code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but the return value is never checked. Use __bio_add_page() as adding a single page to a newly created bio is guaranteed to succeed. This brings us a step closer to marking bio_add_page() as __must_check. Signed-of_-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/ca196f5e650e318106dbb4496eb6cbac4bc800bd.1685532726.git.johannes.thumshirn@wdc.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 14 Apr, 2023 1 commit
-
-
Yu Kuai authored
status_resync() will calculate 'curr_resync - recovery_active' to show user a progress bar like following: [============>........] resync = 61.4% 'curr_resync' and 'recovery_active' is updated in md_do_sync(), and status_resync() can read them concurrently, hence it's possible that 'curr_resync - recovery_active' can overflow to a huge number. In this case status_resync() will be stuck in the loop to print a large amount of '=', which will end up soft lockup. Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case, this way resync in progress will be reported to user. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Signed-off-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com
-