Commits · 0c0cbd4ebc375ceebc75c89df04b74f215fab23a · Kirill Smelkov / linux

27 Jul, 2023 2 commits

ublk: fail to recover device if queue setup is interrupted · 0c0cbd4e

Ming Lei authored Jul 26, 2023

In ublk_ctrl_end_recovery(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_END_USER_RECOVERY, otherwise kernel oops can be
triggered.

Fixes: c732a852 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support")
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-3-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0c0cbd4e

ublk: fail to start device if queue setup is interrupted · 53e7d08f

Ming Lei authored Jul 26, 2023

In ublk_ctrl_start_dev(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_START_DEV, otherwise kernel oops can be triggered.

Reported by German when working on qemu-storage-deamon which requires
single thread ublk daemon.

Fixes: 71f28f31 ("ublk_drv: add io_uring based userspace block driver")
Reported-by: German Maglione <gmaglione@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-2-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

53e7d08f

25 Jul, 2023 1 commit

block: Fix a source code comment in include/uapi/linux/blkzoned.h · e0933b52

Bart Van Assche authored Jul 06, 2023

Fix the symbolic names for zone conditions in the blkzoned.h header
file.

Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Fixes: 6a0cb1bc ("block: Implement support for zoned block devices")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230706201422.3987341-1-bvanassche@acm.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

e0933b52

24 Jul, 2023 4 commits

s390/dasd: print copy pair message only for the correct error · 856d8e3c

Stefan Haberland authored Jul 21, 2023

The DASD driver has certain types of requests that might be rejected by
the storage server or z/VM because they are not supported. Since the
missing support of the command is not a real issue there is no user
visible kernel error message for this.

For copy pair setups  there is a specific error that IO is not allowed on
secondary devices. This error case is explicitly handled and an error
message is printed.

The code checking for the error did use a bitwise 'and' that is used to
check for specific bits. But in this case the whole sense byte has to
match.

This leads to the problem that the copy pair related error message is
erroneously printed for other error cases that are usually not reported.
This might heavily confuse users and lead to follow on actions that might
disrupt application processing.

Fix by checking the sense byte for the exact value and not single bits.

Cc: stable@vger.kernel.org # 6.1+
Fixes: 1fca631a ("s390/dasd: suppress generic error messages for PPRC secondary devices")
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-5-sth@linux.ibm.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

856d8e3c

s390/dasd: fix hanging device after request requeue · 8a2278ce

Stefan Haberland authored Jul 21, 2023

The DASD device driver has a function to requeue requests to the
blocklayer.
This function is used in various cases when basic settings for the device
have to be changed like High Performance Ficon related parameters or copy
pair settings.

The functions iterates over the device->ccw_queue and also removes the
requests from the block->ccw_queue.
In case the device is started on an alias device instead of the base
device it might be removed from the block->ccw_queue without having it
canceled properly before. This might lead to a hanging device since the
request is no longer on a queue and can not be handled properly.

Fix by iterating over the block->ccw_queue instead of the
device->ccw_queue. This will take care of all blocklayer related requests
and handle them on all associated DASD devices.
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-4-sth@linux.ibm.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8a2278ce

s390/dasd: use correct number of retries for ERP requests · acea28a6

Stefan Haberland authored Jul 21, 2023

If a DASD request fails an error recovery procedure (ERP) request might
be built as a copy of the original request to do error recovery.

The ERP request gets a number of retries assigned.
This number is always 256 no matter what other value might have been set
for the original request. This is not what is expected when a user
specifies a certain amount of retries for the device via sysfs.

Correctly use the number of retries of the original request for ERP
requests.
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-3-sth@linux.ibm.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

acea28a6

s390/dasd: fix hanging device after quiesce/resume · 05f1d8ed

Stefan Haberland authored Jul 21, 2023

Quiesce and resume are functions that tell the DASD driver to stop/resume
issuing I/Os to a specific DASD.

On resume dasd_schedule_block_bh() is called to kick handling of IO
requests again. This does unfortunately not cover internal requests which
are used for path verification for example.

This could lead to a hanging device when a path event or anything else
that triggers internal requests occurs on a quiesced device.

Fix by also calling dasd_schedule_device_bh() which triggers handling of
internal requests on resume.

Fixes: 8e09f215 ("[S390] dasd: add hyper PAV support to DASD device driver, part 1")

Cc: stable@vger.kernel.org
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-2-sth@linux.ibm.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

05f1d8ed

21 Jul, 2023 3 commits

loop: do not enforce max_loop hard limit by (new) default · bb5faa99

Mauricio Faria de Oliveira authored Jul 20, 2023

Problem:

The max_loop parameter is used for 2 different purposes:

1) initial number of loop devices to pre-create on init
2) maximum number of loop devices to add on access/open()

Historically, its default value (zero) caused 1) to create non-zero
number of devices (CONFIG_BLK_DEV_LOOP_MIN_COUNT), and no hard limit on
2) to add devices with autoloading.

However, the default value changed in commit 85c50197 ("loop: Fix
the max_loop commandline argument treatment when it is set to 0") to
CONFIG_BLK_DEV_LOOP_MIN_COUNT, for max_loop=0 not to pre-create devices.

That does improve 1), but unfortunately it breaks 2), as the default
behavior changed from no-limit to hard-limit.

Example:

For example, this userspace code broke for N >= CONFIG, if the user
relied on the default value 0 for max_loop:

    mknod("/dev/loopN");
    open("/dev/loopN");  // now fails with ENXIO

Though affected users may "fix" it with (loop.)max_loop=0, this means to
require a kernel parameter change on stable kernel update (that commit
Fixes: an old commit in stable).

Solution:

The original semantics for the default value in 2) can be applied if the
parameter is not set (ie, default behavior).

This still keeps the intended function in 1) and 2) if set, and that
commit's intended improvement in 1) if max_loop=0.

Before 85c50197:
  - default:     1) CONFIG devices   2) no limit
  - max_loop=0:  1) CONFIG devices   2) no limit
  - max_loop=X:  1) X devices        2) X limit

After 85c50197:
  - default:     1) CONFIG devices   2) CONFIG limit (*)
  - max_loop=0:  1) 0 devices (*)    2) no limit
  - max_loop=X:  1) X devices        2) X limit

This commit:
  - default:     1) CONFIG devices   2) no limit (*)
  - max_loop=0:  1) 0 devices        2) no limit
  - max_loop=X:  1) X devices        2) X limit

Future:

The issue/regression from that commit only affects code under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard, thus the fix too is
contained under it.

Once that deprecated functionality/code is removed, the purpose 2) of
max_loop (hard limit) is no longer in use, so the module parameter
description can be changed then.

Tests:

Linux 6.4-rc7
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLOCK_LEGACY_AUTOLOAD=y

- default (original)

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	open: /dev/loop8: No such device or address

- default (patched)

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	#

- max_loop=0 (original & patched):

	# ls -1 /dev/loop*
	/dev/loop-control

	# ./test-loop
	#

- max_loop=8 (original & patched):

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	open: /dev/loop8: No such device or address

- max_loop=0 (patched; CONFIG_BLOCK_LEGACY_AUTOLOAD is not set)

	# ls -1 /dev/loop*
	/dev/loop-control

	# ./test-loop
	open: /dev/loop8: No such device or address

Fixes: 85c50197 ("loop: Fix the max_loop commandline argument treatment when it is set to 0")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-3-mfo@canonical.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bb5faa99

loop: deprecate autoloading callback loop_probe() · 23881aec

Mauricio Faria de Oliveira authored Jul 20, 2023

The 'probe' callback in __register_blkdev() is only used under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard.

The loop_probe() function is only used for that callback, so guard it
too, accordingly.

See commit fbdee71b ("block: deprecate autoloading based on dev_t").
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-2-mfo@canonical.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

23881aec

sbitmap: fix batching wakeup · 10639737

David Jeffery authored Jul 21, 2023

Current code supposes that it is enough to provide forward progress by
just waking up one wait queue after one completion batch is done.

Unfortunately this way isn't enough, cause waiter can be added to wait
queue just after it is woken up.

Follows one example(64 depth, wake_batch is 8)

1) all 64 tags are active

2) in each wait queue, there is only one single waiter

3) each time one completion batch(8 completions) wakes up just one
   waiter in each wait queue, then immediately one new sleeper is added
   to this wait queue

4) after 64 completions, 8 waiters are wakeup, and there are still 8
   waiters in each wait queue

5) after another 8 active tags are completed, only one waiter can be
   wakeup, and the other 7 can't be waken up anymore.

Turns out it isn't easy to fix this problem, so simply wakeup enough
waiters for single batch.

Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20230721095715.232728-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

10639737

20 Jul, 2023 2 commits

blk-iocost: skip empty flush bio in iocost · 013adcbe

Chengming Zhou authored Jul 20, 2023

The flush bio may have data, may have no data (empty flush), we couldn't
calculate cost for empty flush bio. So we'd better just skip it for now.

Another side effect is that empty flush bio's bio_end_sector() is 0, cause
iocg->cursor reset to 0, may break the cost calculation of other bios.

This isn't good enough, since flush bio still consume the device bandwidth,
but flush request is special, can be merged randomly in the flush state
machine, we don't know how to calculate cost for it for now.

Its completion time also has flaws, which may include the pre-flush or
post-flush completion time, but I don't know if we need to fix that and
how to fix it.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230720121441.1408522-1-chengming.zhou@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>

013adcbe

blk-mq: delete dead struct blk_mq_hw_ctx->queued field · 3641c90c

Chengming Zhou authored Jul 20, 2023

This counter is not used anywhere, so delete it.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230720095512.1403123-1-chengming.zhou@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>

3641c90c

14 Jul, 2023 2 commits

blk-mq: Fix stall due to recursive flush plug · 70904263

Ross Lagerwall authored Jul 14, 2023

We have seen rare IO stalls as follows:

* blk_mq_plug_issue_direct() is entered with an mq_list containing two
requests.
* For the first request, it sets last == false and enters the driver's
queue_rq callback.
* The driver queue_rq callback indirectly calls schedule() which calls
blk_flush_plug(). This may happen if the driver has the
BLK_MQ_F_BLOCKING flag set and is allowed to sleep in ->queue_rq.
* blk_flush_plug() handles the remaining request in the mq_list. mq_list
is now empty.
* The original call to queue_rq resumes (with last == false).
* The loop in blk_mq_plug_issue_direct() terminates because there are no
remaining requests in mq_list.

The IO is now stalled because the last request submitted to the driver
had last == false and there was no subsequent call to commit_rqs().

Fix this by returning early in blk_mq_flush_plug_list() if rq_count is 0
which it will be in the recursive case, rather than checking if the
mq_list is empty. At the same time, adjust one of the callers to skip
the mq_list empty check as it is not necessary.

Fixes: dc5fc361 ("block: attempt direct issue of plug list")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230714101106.3635611-1-ross.lagerwall@citrix.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

70904263

block: queue data commands from the flush state machine at the head · 9f87fc4d

Christoph Hellwig authored Jul 14, 2023

We used to insert the data commands following a pre-flush to the head
of the queue until commit 1e82fadf ("blk-mq: do not do head insertions
post-pre-flush commands"). Not doing this seems to cause hangs of
such commands on NFS workloads when exported from file systems with
SATA SSDs. I have no idea why this would starve these workloads,
but doing a semantic revert of this patch (which looks quite different
due to various other changes) fixes the hangs.

Fixes: 1e82fadf ("blk-mq: do not do head insertions post-pre-flush commands")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

9f87fc4d

13 Jul, 2023 4 commits

Merge tag 'nvme-6.5-2023-07-13' of git://git.infradead.org/nvme into block-6.5 · 90b46229

Jens Axboe authored Jul 13, 2023

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.5

 - Don't require quirk to use duplicate namespace identifiers
   (Christoph, Sagi)
 - One more BOGUS_NID quirk (Pankaj)
 - IO timeout and error hanlding fixes for PCI (Keith)
 - Enhanced metadata format mask fix (Ankit)
 - Association race condition fix for fibre channel (Michael)
 - Correct debugfs error checks (Minjie)
 - Use PAGE_SECTORS_SHIFT where needed (Damien)
 - Reduce kernel logs for legacy nguid attribute (Keith)
 - Use correct dma direction when unmapping metadata (Ming)"

* tag 'nvme-6.5-2023-07-13' of git://git.infradead.org/nvme:
  nvme-pci: fix DMA direction of unmapping integrity data
  nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
  nvme: ensure disabling pairs with unquiesce
  nvme-fc: fix race between error recovery and creating association
  nvme-fc: return non-zero status code when fails to create association
  nvme: fix parameter check in nvme_fault_inject_init()
  nvme: warn only once for legacy uuid attribute
  nvme: fix the NVME_ID_NS_NVM_STS_MASK definition
  nvmet: use PAGE_SECTORS_SHIFT
  nvme: add BOGUS_NID quirk for Samsung SM953

90b46229

blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq · 5c17f45e

Chengming Zhou authored Jul 10, 2023

The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
state of the block device. Most of the time request is allocated after
rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.

But for plug batched allocation introduced by the commit 47c122e3
("block: pre-allocate requests if plug is started and is a batch"), we can
rq_qos_throttle() after the allocation of the request. This is what the
blk_mq_get_cached_request() does.

In this case, the cached request alloc_time_ns or start_time_ns is much
ahead if blocked in any qos ->throttle().

Fix it by setting alloc_time_ns and start_time_ns to now when the allocated
request is actually used.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>

5c17f45e

nvme-pci: fix DMA direction of unmapping integrity data · b8f6446b

Ming Lei authored Jul 13, 2023

DMA direction should be taken in dma_unmap_page() for unmapping integrity
data.

Fix this DMA direction, and reported in Guangwu's test.
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: 4aedb705 ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

b8f6446b

nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices · ac522fc6

Christoph Hellwig authored Jul 13, 2023

While duplicate IDs are still very harmful, including the potential to easily
see changing devices in /dev/disk/by-id, it turn out they are extremely
common for cheap end user NVMe devices.

Relax our check for them for so that it doesn't reject the probe on
single-ported PCIe devices, but prints a big warning instead.  In doubt
we'd still like to see quirk entries to disable the potential for
changing supposed stable device identifier links, but this will at least
allow users how have two (or more) of these devices to use them without
having to manually add a new PCI ID entry with the quirk through sysfs or
by patching the kernel.

Fixes: 2079f41e ("nvme: check that EUI/GUID/UUID are globally unique")
Cc: stable@vger.kernel.org # 6.0+
Co-developed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

ac522fc6

12 Jul, 2023 6 commits

block/mq-deadline: Fix a bug in deadline_from_pos() · f673b4f5

Bart Van Assche authored Jul 12, 2023

A bug was introduced in deadline_from_pos() while implementing the
suggestion to use round_down() in the following code:

pos -= bdev_offset_from_zone_start(rq->q->disk->part0, pos);

This patch makes deadline_from_pos() use round_down() such that 'pos' is
rounded down.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/5zthzi3lppvcdp4nemum6qck4gpqbdhvgy4k3qwguhgzxc4quj@amulvgycq67h/
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Fixes: 0effb390 ("block: mq-deadline: Handle requeued requests correctly")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230712173344.2994513-1-bvanassche@acm.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

f673b4f5

nvme: ensure disabling pairs with unquiesce · 71a5bb15

Keith Busch authored Jul 12, 2023

If any error handling that disables the controller fails to queue the
reset work, like if the state changed to disconnected inbetween, then
the failed teardown needs to unquiesce the queues since it's no longer
paired with reset_work. Just make sure that the controller can be put
into a resetting state prior to starting the disable so that no other
handling can change the queue states while recovery is happening.
Reported-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

71a5bb15

nvme-fc: fix race between error recovery and creating association · ee6fdc50

Michael Liang authored Jul 07, 2023

There is a small race window between nvme-fc association creation and error
recovery. Fix this race condition by protecting accessing to controller
state and ASSOC_FAILED flag under nvme-fc controller lock.
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

ee6fdc50

nvme-fc: return non-zero status code when fails to create association · 60e445bd

Michael Liang authored Jul 07, 2023

Return non-zero status code(-EIO) when needed, so re-connecting or
deleting controller will be triggered properly.
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

60e445bd

nvme: fix parameter check in nvme_fault_inject_init() · 733ba346

Minjie Du authored Jul 12, 2023

Make IS_ERR() judge the debugfs_create_dir() function return.
Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

733ba346

nvme: warn only once for legacy uuid attribute · b718ae83

Keith Busch authored Jul 12, 2023

Report the legacy fallback behavior for uuid attributes just once
instead of logging repeated warnings for the same condition every time
the attribute is read. The old behavior is too spamy on the kernel logs.

Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

b718ae83

10 Jul, 2023 4 commits

block: remove dead struc request->completion_data field · dc8cbb65

Jens Axboe authored Jul 10, 2023

It's no longer used. While in there, also update the comment as to why
it can coexist with the rb_node.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dc8cbb65

nvme: fix the NVME_ID_NS_NVM_STS_MASK definition · b938e660

Ankit Kumar authored Jun 23, 2023

As per NVMe command set specification 1.0c Storage tag size is 7 bits.

Fixes: 4020aad8 ("nvme: add support for enhanced metadata")
Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

b938e660

nvmet: use PAGE_SECTORS_SHIFT · 9bcf156f

Damien Le Moal authored Jul 07, 2023

Replace occurences of the pattern "PAGE_SHIFT - 9" in the passthru and
loop targets with PAGE_SECTORS_SHIFT.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

9bcf156f

nvme: add BOGUS_NID quirk for Samsung SM953 · e5bb0988

Pankaj Raghav authored Jul 04, 2023

Add the quirk as SM953 is reporting bogus namespace ID.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217593Reported-by: Clemens Springsguth <cspringsguth@gmail.com>
Tested-by: Clemens Springsguth <cspringsguth@gmail.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

e5bb0988

05 Jul, 2023 3 commits

blk-crypto: use dynamic lock class for blk_crypto_profile::lock · 2fb48d88

Eric Biggers authored Jun 09, 2023

When a device-mapper device is passing through the inline encryption
support of an underlying device, calls to blk_crypto_evict_key() take
the blk_crypto_profile::lock of the device-mapper device, then take the
blk_crypto_profile::lock of the underlying device (nested).  This isn't
a real deadlock, but it causes a lockdep report because there is only
one lock class for all instances of this lock.

Lockdep subclasses don't really work here because the hierarchy of block
devices is dynamic and could have more than 2 levels.

Instead, register a dynamic lock class for each blk_crypto_profile, and
associate that with the lock.

This avoids false-positive lockdep reports like the following:

    ============================================
    WARNING: possible recursive locking detected
    6.4.0-rc5 #2 Not tainted
    --------------------------------------------
    fscryptctl/1421 is trying to acquire lock:
    ffffff80829ca418 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0x44/0x1c0

                   but task is already holding lock:
    ffffff8086b68ca8 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0xc8/0x1c0

                   other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&profile->lock);
      lock(&profile->lock);

                    *** DEADLOCK ***

     May be due to missing lock nesting notation

Fixes: 1b262839 ("block: Keyslot Manager for Inline Encryption")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230610061139.212085-1-ebiggers@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

2fb48d88

block/partition: fix signedness issue for Amiga partitions · 7eb1e476

Michael Schmitz authored Jul 05, 2023

Making 'blk' sector_t (i.e. 64 bit if LBD support is active) fails the
'blk>0' test in the partition block loop if a value of (signed int) -1 is
used to mark the end of the partition block list.

Explicitly cast 'blk' to signed int to allow use of -1 to terminate the
partition block linked list.

Fixes: b6f3f28f ("block: add overflow checks for Amiga partition support")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.deSigned-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7eb1e476

gup: make the stack expansion warning a bit more targeted · 6cd06ab1

Linus Torvalds authored Jul 05, 2023

I added a warning about about GUP no longer expanding the stack in
commit a425ac53 ("gup: add warning if some caller would seem to want
stack expansion"), but didn't really expect anybody to hit it.

And it's true that nobody seems to have hit a _real_ case yet, but we
certainly have a number of reports of false positives.  Which not only
causes extra noise in itself, but might also end up hiding any real
cases if they do exist.

So let's tighten up the warning condition, and replace the simplistic

	vma = find_vma(mm, start);
	if (vma && (start < vma->vm_start)) {
		WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN);

with a

	vma = gup_vma_lookup(mm, start);

helper function which works otherwise like just "vma_lookup()", but with
some heuristics for when to warn about gup no longer causing stack
expansion.

In particular, don't just warn for "below the stack", but warn if it's
_just_ below the stack (with "just below" arbitrarily defined as 64kB,
because why not?).  And rate-limit it to at most once per hour, which
means that any false positives shouldn't completely hide subsequent
reports, but we won't be flooding the logs about it either.

The previous code triggered when some GUP user (chromium crashpad)
accessing past the end of the previous vma, for example.  That has never
expanded the stack, it just causes GUP to return early, and as such we
shouldn't be warning about it.

This is still going trigger the randomized testers, but to mitigate the
noise from that, use "dump_stack()" instead of "WARN_ON_ONCE()" to get
the kernel call chain.  We'll get the relevant information, but syzbot
shouldn't get too upset about it.

Also, don't even bother with the GROWSUP case, which would be using
different heuristics entirely, but only happens on parisc.
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: syzbot+6cf44e127903fdf9d929@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6cd06ab1

04 Jul, 2023 9 commits

Revert ".gitignore: ignore *.cover and *.mbx" · d5280145

Linus Torvalds authored Jul 04, 2023

This reverts commit 534066a9.

It's actively detrimental in that it hides files that shouldn't be
hidden.

If I have some b4 mbx file in my git directory, it either was already
applied with "git am" and is now stale, or maybe it's waiting for that
to happen.  In neither case is "ignore it" the right option.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d5280145

Merge tag 'core_guards_for_6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue · 04f2933d

Linus Torvalds authored Jul 04, 2023

Pull scope-based resource management infrastructure from Peter Zijlstra:
 "These are the first few patches in the Scope-based Resource Management
  series that introduce the infrastructure but not any conversions as of
  yet.

  Adding the infrastructure now allows multiple people to start using
  them.

  Of note is that Sparse will need some work since it doesn't yet
  understand this attribute and might have decl-after-stmt issues"

* tag 'core_guards_for_6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue:
  kbuild: Drop -Wdeclaration-after-statement
  locking: Introduce __cleanup() based infrastructure
  apparmor: Free up __cleanup() name
  dmaengine: ioat: Free up __cleanup() name

04f2933d

afs: Fix accidental truncation when storing data · 03275585

David Howells authored Jul 04, 2023

When an AFS FS.StoreData RPC call is made, amongst other things it is
given the resultant file size to be.  On the server, this is processed
by truncating the file to new size and then writing the data.

Now, kafs has a lock (vnode->io_lock) that serves to serialise
operations against a specific vnode (ie.  inode), but the parameters for
the op are set before the lock is taken.  This allows two writebacks
(say sync and kswapd) to race - and if writes are ongoing the writeback
for a later write could occur before the writeback for an earlier one if
the latter gets interrupted.

Note that afs_writepages() cannot take i_mutex and only takes a shared
lock on vnode->validate_lock.

Also note that the server does the truncation and the write inside a
lock, so there's no problem at that end.

Fix this by moving the calculation for the proposed new i_size inside
the vnode->io_lock.  Also reset the iterator (which we might have read
from) and update the mtime setting there.

Fixes: bd80d8a8 ("afs: Use ITER_XARRAY for writing")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/3526895.1687960024@warthog.procyon.org.uk/Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

03275585

Merge tag 'ovl-update-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs · 538140ca

Linus Torvalds authored Jul 04, 2023

Pull more overlayfs updates from Amir Goldstein:
 "This is a small 'move code around' followup by Christian to his work
  on porting overlayfs to the new mount api for 6.5. It makes things a
  bit cleaner and simpler for the next development cycle when I hand
  overlayfs back over to Miklos"

* tag 'ovl-update-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: move all parameter handling into params.{c,h}

538140ca

Merge tag 'gfs2-v6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · 94c76955

Linus Torvalds authored Jul 04, 2023

Pull gfs2 updates from Andreas Gruenbacher:

 - Move the freeze/thaw logic from glock callback context to process /
   worker thread context to prevent deadlocks

 - Fix a quota reference couting bug in do_qc()

 - Carry on deallocating inodes even when gfs2_rindex_update() fails

 - Retry filesystem-internal reads when they are interruped by a signal

 - Eliminate kmap_atomic() in favor of kmap_local_page() /
   memcpy_{from,to}_page()

 - Get rid of noop_direct_IO

 - And a few more minor fixes and cleanups

* tag 'gfs2-v6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (23 commits)
  gfs2: Add quota_change type
  gfs2: Use memcpy_{from,to}_page where appropriate
  gfs2: Convert remaining kmap_atomic calls to kmap_local_page
  gfs2: Replace deprecated kmap_atomic with kmap_local_page
  gfs: Get rid of unnucessary locking in inode_go_dump
  gfs2: gfs2_freeze_lock_shared cleanup
  gfs2: Replace sd_freeze_state with SDF_FROZEN flag
  gfs2: Rework freeze / thaw logic
  gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
  gfs2: Reconfiguring frozen filesystem already rejected
  gfs2: Rename gfs2_freeze_lock{ => _shared }
  gfs2: Rename the {freeze,thaw}_super callbacks
  gfs2: Rename remaining "transaction" glock references
  gfs2: retry interrupted internal reads
  gfs2: Fix possible data races in gfs2_show_options()
  gfs2: Fix duplicate should_fault_in_pages() call
  gfs2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
  gfs2: Don't remember delete unless it's successful
  gfs2: Update rl_unlinked before releasing rgrp lock
  gfs2: Fix gfs2_qa_get imbalance in gfs2_quota_hold
  ...

94c76955

Merge tag 'pm-6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · ccf46d85

Linus Torvalds authored Jul 04, 2023

Pull more power management updates from Rafael Wysocki:
 "These add support for new hardware (ap807 and AM62A7), fix several
  issues in cpufreq drivers and in the operating performance points
  (OPP) framework, fix up intel_idle after recent changes and add
  documentation.

  Specifics:

   - Add missing __init annotation to one function in the intel_idle
     drvier (Rafael Wysocki)

   - Make intel_pstate use a correct scaling factor when mapping HWP
     performance levels to frequency values on hybrid-capable systems
     with disabled E-cores (Srinivas Pandruvada)

   - Fix Kconfig dependencies of the cpufreq-dt-platform driver (Viresh
     Kumar)

   - Add support to build cpufreq-dt-platdev as a module (Zhipeng Wang)

   - Don't allocate Sparc's cpufreq_driver dynamically (Viresh Kumar)

   - Add support for TI's AM62A7 platform (Vibhore Vardhan)

   - Add support for Armada's ap807 platform (Russell King (Oracle))

   - Add support for StarFive JH7110 SoC (Mason Huo)

   - Fix voltage selection for Mediatek Socs (Daniel Golle)

   - Fix error handling in Tegra's cpufreq driver (Christophe JAILLET)

   - Document Qualcomm's IPQ8074 in DT bindings (Robert Marko)

   - Don't warn for disabling a non-existing frequency for imx6q cpufreq
     driver (Christoph Niedermaier)

   - Use dev_err_probe() in Qualcomm's cpufreq driver (Andrew Halaney)

   - Simplify performance state related logic in the OPP core (Viresh
     Kumar)

   - Fix use-after-free and improve locking around lazy_opp_tables
     (Viresh Kumar, Stephan Gerhold)

   - Minor cleanups - using dev_err_probe() and rate-limiting debug
     messages (Andrew Halaney, Adrián Larumbe)"

* tag 'pm-6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  cpufreq: intel_pstate: Fix scaling for hybrid-capable systems with disabled E-cores
  cpufreq: Make CONFIG_CPUFREQ_DT_PLATDEV depend on OF
  intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()
  OPP: Properly propagate error along when failing to get icc_path
  OPP: Use dev_err_probe() when failing to get icc_path
  cpufreq: qcom-cpufreq-hw: Use dev_err_probe() when failing to get icc paths
  cpufreq: mediatek: correct voltages for MT7622 and MT7623
  cpufreq: armada-8k: add ap807 support
  OPP: Simplify the over-designed pstate <-> level dance
  OPP: pstate is only valid for genpd OPP tables
  OPP: don't drop performance constraint on OPP table removal
  OPP: Protect `lazy_opp_tables` list with `opp_table_lock`
  OPP: Staticize `lazy_opp_tables` in of.c
  cpufreq: dt-platdev: Support building as module
  opp: Fix use-after-free in lazy_opp_tables after probe deferral
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ8074
  cpufreq: dt-platdev: Blacklist ti,am62a7 SoC
  cpufreq: ti-cpufreq: Add support for AM62A7
  OPP: rate-limit debug messages when no change in OPP is required
  cpufreq: imx6q: don't warn for disabling a non-existing frequency
  ...

ccf46d85

Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · b869e9f4

Linus Torvalds authored Jul 04, 2023

Pull more clk updates from Stephen Boyd:
 "Another set of clk driver updates and fixes for the merge window. The
  driver updates needed more time to bake in linux-next.

  Updates:
   - Support for more clk controllers in Qualcomm SoCs such as SM8350,
     SM8450, SDX75, SC8280XP, and IPQ9574
   - Runtime PM enablement of some more Qualcomm clk controllers
   - Various fixes to Qualcomm clk driver data to use correct clk_ops
     and to check halt bits properly
   - AT91 updates to modernize with clk_parent_data structures

  Fixes:
   - Remove 'syscon' from dt binding fix for ti,j721e-system-controller
   - Fix determine rate in the Tegra driver that got wrecked by the
     refactorting of muxes this merge window"

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (69 commits)
  clk: tegra: Avoid calling an uninitialized function
  dt-bindings: mfd: ti,j721e-system-controller: Remove syscon from example
  clk: at91: sama7g5: s/ep_chg_chg_id/ep_chg_id
  clk: at91: sama7g5: switch to parent_hw and parent_data
  clk: at91: sckc: switch to parent_data/parent_hw
  clk: at91: clk-sam9x60-pll: add support for parent_hw
  clk: at91: clk-utmi: add support for parent_hw
  clk: at91: clk-system: add support for parent_hw
  clk: at91: clk-programmable: add support for parent_hw
  clk: at91: clk-peripheral: add support for parent_hw
  clk: at91: clk-master: add support for parent_hw
  clk: at91: clk-generated: add support for parent_hw
  clk: at91: clk-main: add support for parent_data/parent_hw
  clk: qcom: gcc-sc8280xp: Add runtime PM
  clk: qcom: gpucc-sc8280xp: Add runtime PM
  clk: qcom: mmcc-msm8974: fix MDSS_GDSC power flags
  clk: qcom: gpucc-sm6375: Enable runtime pm
  dt-bindings: clock: sm6375-gpucc: Add VDD_GX
  clk: qcom: gcc-sm6115: Add missing PLL config properties
  clk: qcom: clk-alpha-pll: Add a way to update some bits of test_ctl(_hi)
  ...

b869e9f4

Merge tag 'firewire-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 · 406fb9eb

Linus Torvalds authored Jul 04, 2023

Pull firewire updates from Takashi Sakamoto:
 "This consist of three parts; UAPI update, OHCI driver update, and
  several bug fixes.

  Firstly, the 1394 OHCI specification defines method to retrieve
  hardware time stamps for asynchronous communication, which was
  previously unavailable in user space. This adds new events to the
  UAPI, allowing applications to retrieve the time when asynchronous
  packet are received and sent. The new events are tested in the
  bleeding edge of libhinawa and look to work well. The new version of
  libhinawa will be released after current merge window is closed:

    https://git.kernel.org/pub/scm/libs/ieee1394/libhinawa.git/

  Secondly, the FireWire stack includes a PCM device driver for 1394
  OHCI hardware, This change modernizes the driver by managed resource
  (devres) framework.

  Lastly, bug fixes for firewire-net and firewire-core"

* tag 'firewire-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: (25 commits)
  firewire: net: fix use after free in fwnet_finish_incoming_packet()
  firewire: core: obsolete usage of GFP_ATOMIC at building node tree
  firewire: ohci: release buffer for AR req/resp contexts when managed resource is released
  firewire: ohci: use devres for content of configuration ROM
  firewire: ohci: use devres for IT, IR, AT/receive, and AT/request contexts
  firewire: ohci: use devres for list of isochronous contexts
  firewire: ohci: use devres for requested IRQ
  firewire: ohci: use devres for misc DMA buffer
  firewire: ohci: use devres for MMIO region mapping
  firewire: ohci: use devres for PCI-related resources
  firewire: ohci: use devres for memory object of ohci structure
  firewire: fix warnings to generate UAPI documentation
  firewire: fix build failure due to missing module license
  firewire: cdev: implement new event relevant to phy packet with time stamp
  firewire: cdev: add new event to notify phy packet with time stamp
  firewire: cdev: code refactoring to dispatch event for phy packet
  firewire: cdev: implement new event to notify response subaction with time stamp
  firewire: cdev: add new event to notify response subaction with time stamp
  firewire: cdev: code refactoring to operate event of response
  firewire: core: implement variations to send request and wait for response with time stamp
  ...

406fb9eb

module: fix init_module_from_file() error handling · f1962207

Linus Torvalds authored Jul 04, 2023

Vegard Nossum pointed out two different problems with the error handling
in init_module_from_file():

 (a) the idempotent loading code didn't clean up properly in some error
     cases, leaving the on-stack 'struct idempotent' element still in
     the hash table

 (b) failure to read the module file would nonsensically update the
     'invalid_kread_bytes' stat counter with the error value

The first error is quite nasty, in that it can then cause subsequent
idempotent loads of that same file to access stale stack contents of the
previous failure.  The case may not happen in any normal situation
(explaining all the "Tested-by's on the original change), and requires
admin privileges, but syzkaller triggers random bad behavior as a
result:

    BUG: soft lockup in sys_finit_module
    BUG: unable to handle kernel paging request in init_module_from_file
    general protection fault in init_module_from_file
    INFO: task hung in init_module_from_file
    KASAN: out-of-bounds Read in init_module_from_file
    KASAN: slab-out-of-bounds Read in init_module_from_file
    ...

The second error is fairly benign and just leads to nonsensical stats
(and has been around since the debug stats were added).

Vegard also provided a patch for the idempotent loading issue, but I'd
rather re-organize the code and make it more legible using another level
of helper functions than add the usual "goto out" error handling.

Link: https://lore.kernel.org/lkml/20230704100852.23452-1-vegard.nossum@oracle.com/
Fixes: 9b9879fc ("modules: catch concurrent module loads, treat them as idempotent")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reported-by: syzbot+9c2bdc9d24e4a7abe741@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f1962207