1. 14 Apr, 2023 9 commits
    • Yu Kuai's avatar
      md: fix soft lockup in status_resync · 6efddf1e
      Yu Kuai authored
      status_resync() will calculate 'curr_resync - recovery_active' to show
      user a progress bar like following:
      
      [============>........]  resync = 61.4%
      
      'curr_resync' and 'recovery_active' is updated in md_do_sync(), and
      status_resync() can read them concurrently, hence it's possible that
      'curr_resync - recovery_active' can overflow to a huge number. In this
      case status_resync() will be stuck in the loop to print a large amount
      of '=', which will end up soft lockup.
      
      Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case,
      this way resync in progress will be reported to user.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com
      6efddf1e
    • Mariusz Tkaczyk's avatar
      md: add error_handlers for raid0 and linear · c31fea2f
      Mariusz Tkaczyk authored
      After the commit 9631abdb("md: Set MD_BROKEN for RAID1 and RAID10")
      MD_BROKEN must be set if array is failed because state_store() checks it.
      If it is set then -EBUSY is returned to userspace.
      
      For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
      mdadm is unable to trigger clean-up actions. It is a regression.
      
      This patch adds appropriate error_handler for raid0 and linear. The
      error handler sets MD_BROKEN for this device.
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com
      c31fea2f
    • Jon Derrick's avatar
      md: Use optimal I/O size for last bitmap page · 8745faa9
      Jon Derrick authored
      If the bitmap space has enough room, size the I/O for the last bitmap
      page write to the optimal I/O size for the storage device. The expanded
      write is checked that it won't overrun the data or metadata.
      
      The drive this was tested against has higher latencies when there are
      sub-4k writes due to device-side read-mod-writes of its atomic 4k write
      unit. This change helps increase performance by sizing the last bitmap
      page I/O for the device's preferred write unit, if it is given.
      
      Example Intel/Solidigm P5520
      Raid10, Chunk-size 64M, bitmap-size 57228 bits
      
      $ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1
              --assume-clean --bitmap=internal --bitmap-chunk=64M
      $ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60
      
      Without patch:
        write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets
      
      With patch:
        write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets
      
      Biosnoop:
      Without patch:
      Time        Process        PID     Device      LBA        Size      Lat
      1.410377    md0_raid10     6900    nvme0n1   W 16         4096      0.02
      1.410387    md0_raid10     6900    nvme2n1   W 16         4096      0.02
      1.410374    md0_raid10     6900    nvme3n1   W 16         4096      0.01
      1.410381    md0_raid10     6900    nvme1n1   W 16         4096      0.02
      1.410411    md0_raid10     6900    nvme1n1   W 115346512  4096      0.01
      1.410418    md0_raid10     6900    nvme0n1   W 115346512  4096      0.02
      1.410915    md0_raid10     6900    nvme2n1   W 24         3584      0.43 <--
      1.410935    md0_raid10     6900    nvme3n1   W 24         3584      0.45 <--
      1.411124    md0_raid10     6900    nvme1n1   W 24         3584      0.64 <--
      1.411147    md0_raid10     6900    nvme0n1   W 24         3584      0.66 <--
      1.411176    md0_raid10     6900    nvme3n1   W 2019022184 4096      0.01
      1.411189    md0_raid10     6900    nvme2n1   W 2019022184 4096      0.02
      
      With patch:
      Time        Process        PID     Device      LBA        Size      Lat
      5.747193    md0_raid10     727     nvme0n1   W 16         4096      0.01
      5.747192    md0_raid10     727     nvme1n1   W 16         4096      0.02
      5.747195    md0_raid10     727     nvme3n1   W 16         4096      0.01
      5.747202    md0_raid10     727     nvme2n1   W 16         4096      0.02
      5.747229    md0_raid10     727     nvme3n1   W 1196223704 4096      0.02
      5.747224    md0_raid10     727     nvme0n1   W 1196223704 4096      0.01
      5.747279    md0_raid10     727     nvme0n1   W 24         4096      0.01 <--
      5.747279    md0_raid10     727     nvme1n1   W 24         4096      0.02 <--
      5.747284    md0_raid10     727     nvme3n1   W 24         4096      0.02 <--
      5.747291    md0_raid10     727     nvme2n1   W 24         4096      0.02 <--
      5.747314    md0_raid10     727     nvme2n1   W 2234636712 4096      0.01
      5.747317    md0_raid10     727     nvme1n1   W 2234636712 4096      0.02
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJon Derrick <jonathan.derrick@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230224183323.638-4-jonathan.derrick@linux.dev
      8745faa9
    • Jon Derrick's avatar
      md: Fix types in sb writer · 10172f20
      Jon Derrick authored
      Page->index is a pgoff_t and multiplying could cause overflows on a
      32-bit architecture. In the sb writer, this is used to calculate and
      verify the sector being used, and is multiplied by a sector value. Using
      sector_t will cast it to a u64 type and is the more appropriate type for
      the unit. Additionally, the integer size unit is converted to a sector
      unit in later calculations, and is now corrected to be an unsigned type.
      
      Finally, clean up the calculations using variable aliases to improve
      readabiliy.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJon Derrick <jonathan.derrick@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230224183323.638-3-jonathan.derrick@linux.dev
      10172f20
    • Jon Derrick's avatar
      md: Move sb writer loop to its own function · 328e17d8
      Jon Derrick authored
      Preparatory patch for optimal I/O size calculation. Move the sb writer
      loop routine into its own function for clarity.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJon Derrick <jonathan.derrick@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230224183323.638-2-jonathan.derrick@linux.dev
      328e17d8
    • Jiangshan Yi's avatar
    • Thomas Weißschuh's avatar
      md: make kobj_type structures constant · 4d72a9de
      Thomas Weißschuh authored
      Since commit ee6d3dd4 ("driver core: make kobj_type constant.")
      the driver core allows the usage of const struct kobj_type.
      
      Take advantage of this to constify the structure definitions to prevent
      modification at runtime.
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230214-kobj_type-md-v1-1-d6853f707f11@weissschuh.net
      4d72a9de
    • Li Nan's avatar
      md/raid10: fix null-ptr-deref in raid10_sync_request · a405c6f0
      Li Nan authored
      init_resync() inits mempool and sets conf->have_replacemnt at the beginning
      of sync, close_sync() frees the mempool when sync is completed.
      
      After [1] recovery might be skipped and init_resync() is called but
      close_sync() is not. null-ptr-deref occurs with r10bio->dev[i].repl_bio.
      
      The following is one way to reproduce the issue.
      
        1) create a array, wait for resync to complete, mddev->recovery_cp is set
           to MaxSector.
        2) recovery is woken and it is skipped. conf->have_replacement is set to
           0 in init_resync(). close_sync() not called.
        3) some io errors and rdev A is set to WantReplacement.
        4) a new device is added and set to A's replacement.
        5) recovery is woken, A have replacement, but conf->have_replacemnt is
           0. r10bio->dev[i].repl_bio will not be alloced and null-ptr-deref
           occurs.
      
      Fix it by not calling init_resync() if recovery skipped.
      
      [1] commit 7e83ccbe ("md/raid10: Allow skipping recovery when clean arrays are assembled")
      Fixes: 7e83ccbe ("md/raid10: Allow skipping recovery when clean arrays are assembled")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230222041000.3341651-3-linan666@huaweicloud.com
      a405c6f0
    • Li Nan's avatar
      md/raid10: fix task hung in raid10d · 72c215ed
      Li Nan authored
      commit fe630de0 ("md/raid10: avoid deadlock on recovery.") allowed
      normal io and sync io to exist at the same time. Task hung will occur as
      below:
      
      T1                      T2		T3		T4
      raid10d
       handle_read_error
        allow_barrier
         conf->nr_pending--
          -> 0
                              //submit sync io
                              raid10_sync_request
                               raise_barrier
      			  ->will not be blocked
      			  ...
      			//submit to drivers
        raid10_read_request
         wait_barrier
          conf->nr_pending++
           -> 1
      					//retry read fail
      					raid10_end_read_request
      					 reschedule_retry
      					  add to retry_list
      					  conf->nr_queued++
      					   -> 1
      							//sync io fail
      							end_sync_read
      							 __end_sync_read
      							  reschedule_retry
      							   add to retry_list
      					                    conf->nr_queued++
      							     -> 2
       ...
       handle_read_error
       get form retry_list
       conf->nr_queued--
        freeze_array
         wait nr_pending == nr_queued+1
              ->1	      ->2
         //task hung
      
      retry read and sync io will be added to retry_list(nr_queued->2) if they
      fails. raid10d() called handle_read_error() and hung in freeze_array().
      nr_queued will not decrease because raid10d is blocked, nr_pending will
      not increase because conf->barrier is not released.
      
      Fix it by moving allow_barrier() after raid10_read_request().
      raise_barrier() will wait for nr_waiting to become 0. Therefore, sync io
      and regular io will not be issued at the same time.
      
      Also remove the check of nr_queued in stop_waiting_barrier. It can be 0
      but don't need to be blocking. Remove the check for MD_RECOVERY_RUNNING as
      the check is redundent.
      
      Fixes: fe630de0 ("md/raid10: avoid deadlock on recovery.")
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230222041000.3341651-2-linan666@huaweicloud.com
      72c215ed
  2. 13 Apr, 2023 31 commits