1. 28 Sep, 2022 1 commit
  2. 27 Sep, 2022 20 commits
  3. 24 Sep, 2022 9 commits
  4. 23 Sep, 2022 1 commit
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 4324796e
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.1/block
      
      Pull MD updates and fixes from Song:
      
      "1. Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan.
       2. Raid10 performance optimization, by Yu Kuai."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md: Fix spelling mistake in comments of r5l_log
        md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
        md/raid10: convert resync_lock to use seqlock
        md/raid10: fix improper BUG_ON() in raise_barrier()
        md/raid10: prevent unnecessary calls to wake_up() in fast path
        md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait
        md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()
        md: Remove extra mddev_get() in md_seq_start()
        md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()
        md/raid5: Ensure stripe_fill happens on non-read IO with journal
        md/raid5: Don't read ->active_stripes if it's not needed
        md/raid5: Cleanup prototype of raid5_get_active_stripe()
        md/raid5: Drop extern on function declarations in raid5.h
        md/raid5: Refactor raid5_get_active_stripe()
        md: Replace snprintf with scnprintf
        md/raid10: fix compile warning
        md/raid5: Fix spelling mistakes in comments
      4324796e
  5. 22 Sep, 2022 9 commits
    • Zhou nan's avatar
      md: Fix spelling mistake in comments of r5l_log · 65b94b52
      Zhou nan authored
      Fix spelling of dones't in comments.
      Signed-off-by: default avatarZhou nan <zhounan@nfschina.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      65b94b52
    • Logan Gunthorpe's avatar
      md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d · 5e2cf333
      Logan Gunthorpe authored
      A complicated deadlock exists when using the journal and an elevated
      group_thrtead_cnt. It was found with loop devices, but its not clear
      whether it can be seen with real disks. The deadlock can occur simply
      by writing data with an fio script.
      
      When the deadlock occurs, multiple threads will hang in different ways:
      
       1) The group threads will hang in the blk-wbt code with bios waiting to
          be submitted to the block layer:
      
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              __submit_bio+0xe6/0x100
              submit_bio_noacct_nocheck+0x42e/0x470
              submit_bio_noacct+0x4c2/0xbb0
              ops_run_io+0x46b/0x1a30
              handle_stripe+0xcd3/0x36b0
              handle_active_stripes.constprop.0+0x6f6/0xa60
              raid5_do_work+0x177/0x330
      
          Or:
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              __submit_bio+0xe6/0x100
              submit_bio_noacct_nocheck+0x42e/0x470
              submit_bio_noacct+0x4c2/0xbb0
              flush_deferred_bios+0x136/0x170
              raid5_do_work+0x262/0x330
      
       2) The r5l_reclaim thread will hang in the same way, submitting a
          bio to the block layer:
      
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              __submit_bio+0xe6/0x100
              submit_bio_noacct_nocheck+0x42e/0x470
              submit_bio_noacct+0x4c2/0xbb0
              submit_bio+0x3f/0xf0
              md_super_write+0x12f/0x1b0
              md_update_sb.part.0+0x7c6/0xff0
              md_update_sb+0x30/0x60
              r5l_do_reclaim+0x4f9/0x5e0
              r5l_reclaim_thread+0x69/0x30b
      
          However, before hanging, the MD_SB_CHANGE_PENDING flag will be
          set for sb_flags in r5l_write_super_and_discard_space(). This
          flag will never be cleared because the submit_bio() call never
          returns.
      
       3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
          will do no processing on any pending stripes and re-set
          STRIPE_HANDLE. This will cause the raid5d thread to enter an
          infinite loop, constantly trying to handle the same stripes
          stuck in the queue.
      
          The raid5d thread has a blk_plug that holds a number of bios
          that are also stuck waiting seeing the thread is in a loop
          that never schedules. These bios have been accounted for by
          blk-wbt thus preventing the other threads above from
          continuing when they try to submit bios. --Deadlock.
      
      To fix this, add the same wait_event() that is used in raid5_do_work()
      to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
      schedule and wait until the flag is cleared. The schedule action will
      flush the plug which will allow the r5l_reclaim thread to continue,
      thus preventing the deadlock.
      
      However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
      from the same thread and can thus deadlock if the thread is put to
      sleep. So avoid waiting if md_check_recovery() is being called in the
      loop.
      
      It's not clear when the deadlock was introduced, but the similar
      wait_event() call in raid5_do_work() was added in 2017 by this
      commit:
      
          16d997b7 ("md/raid5: simplfy delaying of writes while metadata
                         is updated.")
      
      Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.comSigned-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      5e2cf333
    • Song Liu's avatar
      Merge branch 'md-next-raid10-optimize' into md-next · 74173ff4
      Song Liu authored
      This patchset tries to avoid that two locks are held unconditionally
      in hot path.
      
      Test environment:
      
      Architecture:
      aarch64 Huawei KUNPENG 920
      x86 Intel(R) Xeon(R) Platinum 8380
      
      Raid10 initialize:
      mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 \
          /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
      
      Test cmd:
      (task set -c 0-15) fio -name=0 -ioengine=libaio -direct=1 -\
          group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers \
          -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 \
          -rw=randread
      
      Test result:
      
      aarch64:
      before this patchset:           3.2 GiB/s
      bind node before this patchset: 6.9 Gib/s
      after this patchset:            7.9 Gib/s
      bind node after this patchset:  8.0 Gib/s
      
      x86:(bind node is not tested yet)
      before this patchset: 7.0 GiB/s
      after this patchset : 9.3 GiB/s
      
      Please noted that in the test machine, memory access latency is very bad
      across nodes compare to local node in aarch64, which is why bandwidth
      while bind node is much better.
      74173ff4
    • Yu Kuai's avatar
      md/raid10: convert resync_lock to use seqlock · b9b083f9
      Yu Kuai authored
      Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
      and io can't be dispatched until 'barrier' is dropped.
      
      Since holding the 'barrier' is not common, convert 'resync_lock' to use
      seqlock so that holding lock can be avoided in fast path.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-and-Tested-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      b9b083f9
    • Yu Kuai's avatar
      md/raid10: fix improper BUG_ON() in raise_barrier() · 4f350284
      Yu Kuai authored
      'conf->barrier' is protected by 'conf->resync_lock', reading
      'conf->barrier' without holding the lock is wrong.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      4f350284
    • Yu Kuai's avatar
      md/raid10: prevent unnecessary calls to wake_up() in fast path · 0c0be98b
      Yu Kuai authored
      Currently, wake_up() is called unconditionally in fast path such as
      raid10_make_request(), which will cause lock contention under high
      concurrency:
      
      raid10_make_request
       wake_up
        __wake_up_common_lock
         spin_lock_irqsave
      
      Improve performance by only call wake_up() if waitqueue is not empty
      in allow_barrier() and raid10_make_request().
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      0c0be98b
    • Yu Kuai's avatar
      md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait · 0de57e54
      Yu Kuai authored
      For the case nowait in wait_barrier(), there is no point to increase
      nr_waiting and then decrease it.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      0de57e54
    • Yu Kuai's avatar
      md/raid10: factor out code from wait_barrier() to stop_waiting_barrier() · ed2e063f
      Yu Kuai authored
      Currently the nasty condition in wait_barrier() is hard to read. This
      patch factors out the condition into a function.
      
      There are no functional changes.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Acked-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      ed2e063f
    • Logan Gunthorpe's avatar
      md: Remove extra mddev_get() in md_seq_start() · 3bfc3bcd
      Logan Gunthorpe authored
      A regression is seen where mddev devices stay permanently after they
      are stopped due to an elevated reference count.
      
      This was tracked down to an extra mddev_get() in md_seq_start().
      
      It only happened rarely because most of the time the md_seq_start()
      is called with a zero offset. The path with an extra mddev_get() only
      happens when it starts with a non-zero offset.
      
      The commit noted below changed an mddev_get() to check its success
      but inadvertently left the original call in. Remove the extra call.
      
      Fixes: 12a6caf2 ("md: only delete entries from all_mddevs when the disk is freed")
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarGuoqing Jiang <Guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3bfc3bcd