Commits · 35a0a409fa269c287c4378f1aefe84ae8b5211a1 · Kirill Smelkov / linux

12 Jul, 2024 2 commits

md-cluster: fix no recovery job when adding/re-adding a disk · 35a0a409

Heming Zhao authored Jul 09, 2024

The commit db5e653d ("md: delay choosing sync action to
md_start_sync()") delays the start of the sync action. In a
clustered environment, this will cause another node to first
activate the spare disk and skip recovery. As a result, no
nodes will perform recovery when a disk is added or re-added.

Before db5e653d:

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + md_choose_sync_action           process_metadata_update
 |  remove_and_add_spares           //node1 has not finished adding
 + call mddev->sync_work            //the spare disk:do nothing

md_start_sync
 starts md_do_sync

md_do_sync
 + grabbed resync_lockres:DLM_LOCK_EX
 + do syncing job

md_check_recovery
 sendmsg: METADATA_UPDATED
                                 process_metadata_update
                                   //activate spare disk

                                 ... ...

                                 md_do_sync
                                  waiting to grab resync_lockres:EX
```

After db5e653d:

(note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
starts md_start_sync, setting the INTR action will exacerbate the
delay in node1 calling the md_do_sync function.)

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + calls mddev->sync_work         process_metadata_update
                                   //node1 has not finished adding
                                   //the spare disk:do nothing

md_start_sync
 + md_choose_sync_action
 |  remove_and_add_spares
 + calls md_do_sync

md_check_recovery
 md_update_sb
  sendmsg: METADATA_UPDATED
                                  process_metadata_update
                                    //activate spare disk

  ... ...                         ... ...

                                  md_do_sync
                                   + grabbed resync_lockres:EX
                                   + raid1_sync_request skip sync under
				     conf->fullsync:0
md_do_sync
 1. waiting to grab resync_lockres:EX
 2. when node1 could grab EX lock,
    node1 will skip resync under recovery_offset:MaxSector
```

How to trigger:

```(commands @node1)
 # to easily watch the recovery status
echo 2000 > /proc/sys/dev/raid/speed_limit_max
ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"

mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
mdadm --manage /dev/md0 --add /dev/sdc

=== "cat /proc/mdstat" on both node, there are no recovery action. ===
```

How to fix:

because md layer code logic is hard to restore for speeding up sync job
on local node, we add new cluster msg to pending the another node to
active disk.
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com

35a0a409

md-cluster: fix hanging issue while a new disk adding · fff42f21

Heming Zhao authored Jul 09, 2024

The commit 1bbe254e ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.

When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.

How to fix:

1. In dlm_lock_sync(), change the wait behaviour from forever to a
   timeout, This could avoid the hanging issue when another node
   fails to handle cluster msg. Another result of this change is
   that if another node receives an unknown msg (e.g. a new msg_type),
   the old code will hang, whereas the new code will timeout and fail.
   This could help cluster_md handle new msg_type from different
   nodes with different kernel/module versions (e.g. The user only
   updates one leg's kernel and monitors the stability of the new
   kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
   design (must successfully unlock ->message_lockres). This commit
   makes this function return an error number when an error occurs.

Fixes: 1bbe254e ("md-cluster: check for timeout while a new disk adding")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com

fff42f21

10 Jul, 2024 4 commits

floppy: add missing MODULE_DESCRIPTION() macro · 3c1743a6

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/floppy.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Denis Efremov <efremov@linux.com>
Link: https://lore.kernel.org/r/20240602-md-block-floppy-v1-1-bc628ea5eb84@quicinc.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3c1743a6

loop: add missing MODULE_DESCRIPTION() macro · 7d4425d2

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/loop.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240602-md-block-loop-v1-1-b9b7e2603e72@quicinc.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7d4425d2

ublk_drv: add missing MODULE_DESCRIPTION() macro · c25a271c

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/ublk_drv.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240602-md-block-ublk_drv-v1-1-995474cafff0@quicinc.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c25a271c

xen/blkback: add missing MODULE_DESCRIPTION() macro · 4c33e39f

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/xen-blkback/xen-blkback.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240602-md-block-xen-blkback-v1-1-6ff5b58bdee1@quicinc.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4c33e39f

09 Jul, 2024 9 commits

block/rnbd: Constify struct kobj_type · e4eaca5e

Christophe JAILLET authored Jul 08, 2024

'struct kobj_type' is not modified in this driver. It is only used with
kobject_init_and_add() which takes a "const struct kobj_type *" parameter.

Constifying this structure moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
   4082	    792	      8	   4882	   1312	drivers/block/rnbd/rnbd-srv-sysfs.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
   4210	    672	      8	   4890	   131a	drivers/block/rnbd/rnbd-srv-sysfs.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/e3d454173ffad30726c9351810d3aa7b75122711.1720462252.git.christophe.jaillet@wanadoo.frSigned-off-by: Jens Axboe <axboe@kernel.dk>

e4eaca5e

block: take offset into account in blk_bvec_map_sg again · 61353a63

Christoph Hellwig authored Jul 09, 2024

The rebase of commit 09595e0c ("block: pass a phys_addr_t to
get_max_segment_size") lost adding the total to to the offset in
blk_bvec_map_sg.  Add it back.

Fixes: 09595e0c ("block: pass a phys_addr_t to get_max_segment_size")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240709070126.3019940-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

61353a63

block: fix get_max_segment_size() warning · 0ffc46eb

Chaitanya Kulkarni authored Jul 08, 2024

Correct the parameter name in the comment of get_max_segment_size()
to fix following warning:-

block/blk-merge.c:220: warning: Function parameter or struct member 'len' not described in 'get_max_segment_size'
block/blk-merge.c:220: warning: Excess function parameter 'max_len' description in 'get_max_segment_size'
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240709045432.8688-1-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0ffc46eb

loop: Don't bother validating blocksize · 9423c653

John Garry authored Jul 08, 2024

The block queue limits validation does this for us now.

The loop_configure() -> WARN_ON_ONCE() call is dropped, as an invalid
block size would trigger this now. We don't want userspace to be able to
directly trigger WARNs.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20240708091651.177447-6-john.g.garry@oracle.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9423c653

virtio_blk: Don't bother validating blocksize · af281722

John Garry authored Jul 08, 2024

The block queue limits validation does this for us now.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240708091651.177447-5-john.g.garry@oracle.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

af281722

null_blk: Don't bother validating blocksize · addc3a68

John Garry authored Jul 08, 2024

The block queue limits validation does this for us now.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20240708091651.177447-4-john.g.garry@oracle.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

addc3a68

block: Validate logical block size in blk_validate_limits() · fe3d508b

John Garry authored Jul 08, 2024

Some drivers validate that their own logical block size. It is no harm to
always do this, so validate in blk_validate_limits().

This allows us to remove the validation in most of those drivers.

Add a comment to blk_validate_block_size() to inform users that self-
validation of LBS is usually unnecessary.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240708091651.177447-3-john.g.garry@oracle.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fe3d508b

virtio_blk: Fix default logical block size fallback · 4ff3d012

John Garry authored Jul 08, 2024

If we fail to read a logical block size in virtblk_read_limits() ->
virtio_cread_feature(), then we default to what is in
lim->logical_block_size, but that would be 0.

We can deal with lim->logical_block_size = 0 later in the
blk_mq_alloc_disk(), but the code in virtblk_read_limits() needs a proper
default, so give a default of SECTOR_SIZE.

Fixes: 27e32cd2 ("block: pass a queue_limits argument to blk_mq_alloc_disk")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240708091651.177447-2-john.g.garry@oracle.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4ff3d012

Merge tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme into for-6.11/block · 6b43537f

Jens Axboe authored Jul 08, 2024

Pull NVMe updates from Keith:

"nvme updates for Linux 6.11

 - Device initialization memory leak fixes (Keith)
 - More constants defined (Weiwen)
 - Target debugfs support (Hannes)
 - PCIe subsystem reset enhancements (Keith)
 - Queue-depth multipath policy (Redhat and PureStorage)
 - Implement get_unique_id (Christoph)
 - Authentication error fixes (Gaosheng)"

* tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme: (21 commits)
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  nvme-multipath: implement "queue-depth" iopolicy
  nvme-multipath: prepare for "queue-depth" iopolicy
  nvme-pci: do not directly handle subsys reset fallout
  lpfc_nvmet: implement 'host_traddr'
  nvme-fcloop: implement 'host_traddr'
  nvmet-fc: implement host_traddr()
  nvmet-rdma: implement host_traddr()
  nvmet-tcp: implement host_traddr()
  nvmet: add 'host_traddr' callback for debugfs
  nvmet: add debugfs support
  mailmap: add entry for Weiwen Hu
  nvme: rename CDR/MORE/DNR to NVME_STATUS_*
  nvme: fix status magic numbers
  nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err
  nvme: split device add from initialization
  nvme: fc: split controller bringup handling
  nvme: rdma: split controller bringup handling
  nvme: tcp: split controller bringup handling
  ...

6b43537f

08 Jul, 2024 4 commits

nvmet-auth: fix nvmet_auth hash error handling · 89f58f96

Gaosheng Cui authored Jul 06, 2024

If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc
for shash, we should free the memory allocation for challenge, so add
err path out_free_challenge to fix the memory leak.

Fixes: 7a277c37 ("nvmet-auth: Diffie-Hellman key exchange support")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

89f58f96

nvme: implement ->get_unique_id · 18f03a06

Christoph Hellwig authored Jul 05, 2024

Implement the get_unique_id method to allow pNFS SCSI layout access to
NVMe namespaces.

This is the server side implementation of RFC 9561 "Using the Parallel
NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe)
Storage Devices".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>

18f03a06

block: pass a phys_addr_t to get_max_segment_size · 09595e0c

Christoph Hellwig authored Jul 06, 2024

Work on a single address to simplify the logic, and prepare the callers
from using better helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240706075228.2350978-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

09595e0c

block: add a bvec_phys helper · 25f76c3d

Christoph Hellwig authored Jul 06, 2024

Get callers out of poking into bvec internals a bit more. Not a huge win
right now, but with the proposed new DMA mapping API we might end up with
a lot more of this otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240706075228.2350978-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

25f76c3d

05 Jul, 2024 12 commits

blk-lib: check for kill signal in ioctl BLKZEROOUT · bf86bcdb

Christoph Hellwig authored Jul 01, 2024

Zeroout can access a significant capacity and take longer than the user
expected. A user may change their mind about wanting to run that
command and attempt to kill the process and do something else with their
device. But since the task is uninterruptable, they have to wait for it
to finish, which could be many hours.

Add a new BLKDEV_ZERO_KILLABLE flag for blkdev_issue_zeroout that checks
for a fatal signal at each iteration so the user doesn't have to wait for
their regretted operation to complete naturally.

Heavily based on an earlier patch from Keith Busch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-11-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

bf86bcdb

block: limit the Write Zeroes to manually writing zeroes fallback · 39722a2f

Christoph Hellwig authored Jul 01, 2024

Only fall back from hardware Write Zeroes failures when
blkdev_issue_write_zeroes returns -EOPNOTSUPP;

Note that blkdev_issue_write_zeroes turns any failure into -EOPNOTSUPP
when the write zeroes queue limit has been cleared to 0, so this still
catches all I/O errors where the driver detected missing support
for the hardware acceleration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-10-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

39722a2f

block: refacto blkdev_issue_zeroout · 99800ced

Christoph Hellwig authored Jul 01, 2024

Split out two well-defined helpers for hardware supported Write Zeroes
and manually writing zeroes using the Write command.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

99800ced

block: move read-only and supported checks into (__)blkdev_issue_zeroout · f6eacb26

Christoph Hellwig authored Jul 01, 2024

Move these checks out of the lower level helpers and into the higher level
ones to prepare for refactoring.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

f6eacb26

block: remove the LBA alignment check in __blkdev_issue_zeroout · ff760a8f

Christoph Hellwig authored Jul 01, 2024

__blkdev_issue_zeroout is a purely kernel internal API and thus can rely
on the block layer sector alignment checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

ff760a8f

block: factor out a blk_write_zeroes_limit helper · 73a768d5

Christoph Hellwig authored Jul 01, 2024

Contrary to the comment in __blkdev_issue_write_zeroes, nothing here
checks for a potential bi_size overflow. Add a helper mirroring
the secure erase code for the check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

73a768d5

block: Remove blk_alloc_zone_bitmap() · 2f20872e

Damien Le Moal authored Jul 04, 2024

Remove the helper function blk_alloc_zone_bitmap() and replace its
single call site with a call to bitmap_alloc(). To be consistent with
this change, use bitmap_free() to free a disk convnetional zone bitmap.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-6-dlemoal@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

2f20872e

block: Remove REQ_OP_ZONE_RESET_ALL emulation · f2a7bea2

Damien Le Moal authored Jul 04, 2024

Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this operation. With this, the request queue feature
BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in
blk-zone.c can be removed.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

f2a7bea2

dm: handle REQ_OP_ZONE_RESET_ALL · 81e77063

Damien Le Moal authored Jul 04, 2024

This commit implements processing of the REQ_OP_ZONE_RESET_ALL operation
for zoned mapped devices. Given that this operation always has a BIO
sector of 0 and a 0 size, processing through the regular BIO
__split_and_process_bio() function does not work because this function
would always select the first target. Instead, handling of this
operation is implemented using the function __send_zone_reset_all().

Similarly to the __send_empty_flush() function, the new
__send_zone_reset_all() function manually goes through all targets of a
mapped device table doing the following:
1) If the target can natively support REQ_OP_ZONE_RESET_ALL,
   __send_duplicate_bios() is used to forward the reset all operation to
   the target. This case is handled with the
   __send_zone_reset_all_native() function.
2) For other targets, the function __send_zone_reset_all_emulated() is
   executed to emulate the execution of REQ_OP_ZONE_RESET_ALL using
   regular REQ_OP_ZONE_RESET operations.

Targets that can natively support REQ_OP_ZONE_RESET_ALL are identified
using the new target field zone_reset_all_supported. This boolean is set
to true in for targets that have reliable zone limits, that is, targets
that map all sequential write required zones of their zoned device(s).
Setting this field is handled in dm_set_zones_restrictions() and
device_get_zone_resource_limits().

For targets with unreliable zone limits, REQ_OP_ZONE_RESET_ALL must be
emulated (case 2 above). This is implemented with
__send_zone_reset_all_emulated() and is similar to the block layer
function blkdev_zone_reset_all_emulated(): first a report zones is done
for the zones of the target to identify zones that need reset, that is,
any sequential write required zone that is not already empty. This is
done using a bitmap and the function dm_zone_get_reset_bitmap() which
sets to 1 the bit corresponding to a zone that needs reset. Next, this
zone bitmap is inspected and a clone BIO modified to use the
REQ_OP_ZONE_RESET operation issued for any zone with its bit set in the
zone bitmap.

This implementation is more efficient than what the block layer does
with blkdev_zone_reset_all_emulated(), which is always used for DM zoned
devices currently: as we can natively use REQ_OP_ZONE_RESET_ALL on
targets mapping all sequential write required zones, resetting all zones
of a zoned mapped device can be much faster compared to always emulating
this operation using regular per-zone reset. In the worst case, this
implementation is as-efficient as the block layer emulation. This
reduction in the time it takes to reset all zones of a zoned mapped
device depends directly on the mapped device targets mapping (reliable
zone limits or not).
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-4-dlemoal@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

81e77063

dm: Refactor is_abnormal_io() · ae7e965b

Damien Le Moal authored Jul 04, 2024

Use a single switch-case to simplify is_abnormal_io() and make this
function more readable and easier to modify.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-3-dlemoal@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

ae7e965b

null_blk: Introduce the zone_full parameter · f4d5dc33

Damien Le Moal authored Jul 04, 2024

Allow creating a zoned null_blk device with the initial state of its
sequential write required zones to be FULL. This is convenient to avoid
having to first write these zones to perform read performance evaluation
or test zone management operations such as zone reset (and zone reset
all).
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-2-dlemoal@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

f4d5dc33

loop: remove the unused inode variable in loop_configure · dd54fd4e

Christoph Hellwig authored Jul 05, 2024

Remove the inode variable now that the last user is gone.

Fixes: a17ece76 ("loop: regularize upgrading the block size for direct I/O")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240705053114.2042976-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

dd54fd4e

04 Jul, 2024 9 commits

block: reuse original bio_vec array for integrity during clone · ba942238

Anuj Gupta authored Jul 02, 2024

Modify bio_integrity_clone to reuse the original bvec array instead of
allocating and copying it, similar to how bio data path is cloned.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240702100753.2168-1-anuj20.g@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ba942238

null_blk: don't initialize static 'g_virt_boundary' to false · a18df07b

Zhu Yanjun authored Jul 04, 2024

No functional changes intended.
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240704010638.324349-1-yanjun.zhu@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>

a18df07b

Merge tag 'md-6.11-20240704' of... · 7d251bec

Jens Axboe authored Jul 04, 2024

Merge tag 'md-6.11-20240704' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.11/block

Merge MD fixes from Song:

"This PR contains various small fixes by Yu Kuai,
 Benjamin Marzinski, Christophe JAILLET, and Yang Li."

* tag 'md-6.11-20240704' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid5: recheck if reshape has finished with device_lock held
  md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl
  md-cluster: Constify struct md_cluster_operations
  md: Remove unneeded semicolon
  md/raid5: fix spares errors about rcu usage

7d251bec

block: t10-pi: Return correct ref tag when queue has no integrity profile · 162e0687

Anuj Gupta authored Jul 04, 2024

Commit c6e56cf6 ("block: move integrity information into
queue_limits") changed the ref tag calculation logic. It would break if
there is no integrity profile. This in turn causes read/write failures
for such cases.

Fixes: c6e56cf6 ("block: move integrity information into queue_limits")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20240704061515.282343-1-joshi.k@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

162e0687

md/raid5: recheck if reshape has finished with device_lock held · 25b3a823

Benjamin Marzinski authored Jul 02, 2024

When handling an IO request, MD checks if a reshape is currently
happening, and if so, where the IO sector is in relation to the reshape
progress. MD uses conf->reshape_progress for both of these tasks. When
the reshape finishes, conf->reshape_progress is set to MaxSector. If
this occurs after MD checks if the reshape is currently happening but
before it calls ahead_of_reshape(), then ahead_of_reshape() will end up
comparing the IO sector against MaxSector. During a backwards reshape,
this will make MD think the IO sector is in the area not yet reshaped,
causing it to use the previous configuration, and map the IO to the
sector where that data was before the reshape.

This bug can be triggered by running the lvm2
lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop,
although it's very hard to reproduce.

Fix this by factoring the code that checks where the IO sector is in
relation to the reshape out to a helper called get_reshape_loc(),
which reads reshape_progress and reshape_safe while holding the
device_lock, and then rechecks if the reshape has finished before
calling ahead_of_reshape with the saved values.

Also use the helper during the REQ_NOWAIT check to see if the location
is inside of the reshape region.

Fixes: fef9c61f ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com

25b3a823

md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl · a1fd37f9

Yu Kuai authored Jun 27, 2024

Commit 90f5f7ad ("md: Wait for md_check_recovery before attempting
device removal.") explained in the commit message that failed device
must be reomoved from the personality first by md_check_recovery(),
before it can be removed from the array. That's the reason the commit
add the code to wait for MD_RECOVERY_NEEDED.

However, this is not the case now, because remove_and_add_spares() is
called directly from hot_remove_disk() from ioctl path, hence failed
device(marked faulty) can be removed from the personality by ioctl.

On the other hand, the commit introduced a performance problem that
if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will
wait for 5s before it can return failure to user.

Since the waiting is not needed now, fix the problem by removing the
waiting.

Fixes: 90f5f7ad ("md: Wait for md_check_recovery before attempting device removal.")
Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com>
Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com

a1fd37f9

md-cluster: Constify struct md_cluster_operations · 1f4a72ff

Christophe JAILLET authored Jun 23, 2024

'struct md_cluster_operations' is not modified in this driver.

Constifying this structure moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  51941	   1442	     80	  53463	   d0d7	drivers/md/md-cluster.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  52133	   1246	     80	  53459	   d0d3	drivers/md/md-cluster.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr

1f4a72ff

md: Remove unneeded semicolon · ae720670

Yang Li authored Jun 18, 2024

./drivers/md/md.c:630:21-22: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9344Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240618010759.85416-1-yang.lee@linux.alibaba.com

ae720670

md/raid5: fix spares errors about rcu usage · 2314c2e3

Yu Kuai authored Jun 15, 2024

As commit ad860670 ("md/raid5: remove rcu protection to access rdev
from conf") explains, rcu protection can be removed, however, there are
three places left, there won't be any real problems.

drivers/md/raid5.c:8071:24: error: incompatible types in comparison expression (different address spaces):
drivers/md/raid5.c:8071:24: struct md_rdev [noderef] __rcu *
drivers/md/raid5.c:8071:24: struct md_rdev *
drivers/md/raid5.c:7569:25: error: incompatible types in comparison expression (different address spaces):
drivers/md/raid5.c:7569:25: struct md_rdev [noderef] __rcu *
drivers/md/raid5.c:7569:25: struct md_rdev *
drivers/md/raid5.c:7573:25: error: incompatible types in comparison expression (different address spaces):
drivers/md/raid5.c:7573:25: struct md_rdev [noderef] __rcu *
drivers/md/raid5.c:7573:25: struct md_rdev *

Fixes: ad860670 ("md/raid5: remove rcu protection to access rdev from conf")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240615085143.1648223-1-yukuai1@huaweicloud.com

2314c2e3