Commits · 9c2adfa6ba134efaf0bdeb15f76f99e9bcffb903 · Kirill Smelkov / linux

20 Aug, 2021 3 commits

dm ima: prefix ima event name related to device mapper with dm_ · 9c2adfa6

Tushar Sugandhi authored Aug 13, 2021

The event names for the DM events recorded in the ima log do not contain
any information to indicate the events are part of the DM devices/targets.

Prefix the event names for DM events with "dm_" to indicate that they
are part of device-mapper.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Suggested-by: Thore Sommer <public@thson.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

9c2adfa6

dm ima: add version info to dm related events in ima log · dc7b79cc

Tushar Sugandhi authored Aug 13, 2021

The DM events present in the ima log contain various attributes in the
key=value format.  The attributes' names/values may change in future,
and new attributes may also get added.  The attestation server needs
some versioning to determine which attributes are supported and are
expected in the ima log.

Add version information to the DM events present in the ima log to
help attestation servers to correctly process the attributes across
different versions.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dc7b79cc

dm ima: prefix dm table hashes in ima log with hash algorithm · 8f509fd4

Tushar Sugandhi authored Aug 13, 2021

The active/inactive table hashes measured in the ima log do not contain
the information about hash algorithm.  This information is useful for the
attestation servers to recreate the hashes and compare them with the ones
present in the ima log to verify the table contents.

Prefix the table hashes in various DM events in ima log with the hash
algorithm used to compute those hashes.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

8f509fd4

18 Aug, 2021 1 commit

dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() · 528b16bf

Arne Welzel authored Aug 14, 2021

On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.

Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.

This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().

Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.

This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.

 * Disable swap
 * Tune vm settings to promote regular writeback
     $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
     $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
     $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

 * Create 8 dmcrypt devices based on files on a tmpfs
 * Create and mount an ext4 filesystem on each crypt devices
 * Run stress-ng --hdd 8 within one of above filesystems

Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():

    99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
      |
      --ret_from_fork
        kthread
        worker_thread
        |
         --99.92%--process_one_work
            |
            |--80.52%--kcryptd_crypt
            |    |
            |    |--62.58%--mempool_alloc
            |    |  |
            |    |   --62.24%--crypt_page_alloc
            |    |     |
            |    |      --61.51%--__percpu_counter_compare
            |    |        |
            |    |         --61.34%--__percpu_counter_sum
            |    |           |
            |    |           |--58.68%--_raw_spin_lock_irqsave
            |    |           |  |
            |    |           |   --58.30%--native_queued_spin_lock_slowpath
            |    |           |
            |    |            --0.69%--cpumask_next
            |    |                |
            |    |                 --0.51%--_find_next_bit
            |    |
            |    |--10.61%--crypt_convert
            |    |          |
            |    |          |--6.05%--xts_crypt
            ...

After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.

    |--8.15%--mempool_alloc
    |    |
    |    |--3.93%--crypt_page_alloc
    |    |    |
    |    |     --3.75%--__alloc_pages
    |    |         |
    |    |          --3.62%--get_page_from_freelist
    |    |              |
    |    |               --3.22%--rmqueue_bulk
    |    |                   |
    |    |                    --2.59%--_raw_spin_lock
    |    |                      |
    |    |                       --2.57%--native_queued_spin_lock_slowpath
    |    |
    |     --3.05%--_raw_spin_lock_irqsave
    |               |
    |                --2.49%--native_queued_spin_lock_slowpath
Suggested-by: DJ Gregor <dj@corelight.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
Fixes: 5059353d ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

528b16bf

10 Aug, 2021 13 commits

dm: add documentation for IMA measurement support · 00d43995

Tushar Sugandhi authored Jul 12, 2021

To interpret various DM target measurement data in IMA logs,
a separate documentation page is needed under
Documentation/admin-guide/device-mapper.

Add documentation to help system administrators and attestation
client/server component owners to interpret the measurement
data generated by various DM targets, on various device/table state
changes.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

00d43995

dm: update target status functions to support IMA measurement · 8ec45662

Tushar Sugandhi authored Jul 12, 2021

For device mapper targets to take advantage of IMA's measurement
capabilities, the status functions for the individual targets need to be
updated to handle the status_type_t case for value STATUSTYPE_IMA.

Update status functions for the following target types, to log their
respective attributes to be measured using IMA.
 01. cache
 02. crypt
 03. integrity
 04. linear
 05. mirror
 06. multipath
 07. raid
 08. snapshot
 09. striped
 10. verity

For rest of the targets, handle the STATUSTYPE_IMA case by setting the
measurement buffer to NULL.

For IMA to measure the data on a given system, the IMA policy on the
system needs to be updated to have the following line, and the system
needs to be restarted for the measurements to take effect.

/etc/ima/ima-policy
 measure func=CRITICAL_DATA label=device-mapper template=ima-buf

The measurements will be reflected in the IMA logs, which are located at:

/sys/kernel/security/integrity/ima/ascii_runtime_measurements
/sys/kernel/security/integrity/ima/binary_runtime_measurements

These IMA logs can later be consumed by various attestation clients
running on the system, and send them to external services for attesting
the system.

The DM target data measured by IMA subsystem can alternatively
be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
DM_TABLE_STATUS_CMD.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

8ec45662

dm ima: measure data on device rename · 7d1d1df8

Tushar Sugandhi authored Jul 12, 2021

A given block device is identified by it's name and UUID.  However, both
these parameters can be renamed.  For an external attestation service to
correctly attest a given device, it needs to keep track of these rename
events.

Update the device data with the new values for IMA measurements.  Measure
both old and new device name/UUID parameters in the same IMA measurement
event, so that the old and the new values can be connected later.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7d1d1df8

dm ima: measure data on table clear · 99169b93

Tushar Sugandhi authored Jul 12, 2021

For a given block device, an inactive table slot contains the parameters
to configure the device with.  The inactive table can be cleared
multiple times, accidentally or maliciously, which may impact the
functionality of the device, and compromise the system.  Therefore it is
important to measure and log the event when a table is cleared.

Measure device parameters, and table hashes when the inactive table slot
is cleared.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

99169b93

dm ima: measure data on device remove · 84010e51

Tushar Sugandhi authored Jul 12, 2021

Presence of an active block-device, configured with expected parameters,
is important for an external attestation service to determine if a system
meets the attestation requirements.  Therefore it is important for DM to
measure the device remove events.

Measure device parameters and table hashes when the device is removed,
using either remove or remove_all.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

84010e51

dm ima: measure data on device resume · 8eb6fab4

Tushar Sugandhi authored Jul 12, 2021

A given block device can load a table multiple times, with different
input parameters, before eventually resuming it.  Further, a device may
be suspended and then resumed.  The device may never resume after a
table-load.  Because of the above valid scenarios for a given device,
it is important to measure and log the device resume event using IMA.

Also, if the table is large, measuring it in clear-text each time the
device changes state, will unnecessarily increase the size of IMA log.
Since the table clear-text is already measured during table-load event,
measuring the hash during resume should be sufficient to validate the
table contents.

Measure the device parameters, and hash of the active table, when the
device is resumed.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

8eb6fab4

dm ima: measure data on table load · 91ccbbac

Tushar Sugandhi authored Jul 12, 2021

DM configures a block device with various target specific attributes
passed to it as a table. DM loads the table, and calls each target’s
respective constructors with the attributes as input parameters.
Some of these attributes are critical to ensure the device meets
certain security bar. Thus, IMA should measure these attributes, to
ensure they are not tampered with, during the lifetime of the device.
So that the external services can have high confidence in the
configuration of the block-devices on a given system.

Some devices may have large tables. And a given device may change its
state (table-load, suspend, resume, rename, remove, table-clear etc.)
many times. Measuring these attributes each time when the device
changes its state will significantly increase the size of the IMA logs.
Further, once configured, these attributes are not expected to change
unless a new table is loaded, or a device is removed and recreated.
Therefore the clear-text of the attributes should only be measured
during table load, and the hash of the active/inactive table should be
measured for the remaining device state changes.

Export IMA function ima_measure_critical_data() to allow measurement
of DM device parameters, as well as target specific attributes, during
table load. Compute the hash of the inactive table and store it for
measurements during future state change. If a load is called multiple
times, update the inactive table hash with the hash of the latest
populated table. So that the correct inactive table hash is measured
when the device transitions to different states like resume, remove,
rename, etc.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # leak fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

91ccbbac

dm writecache: add event counters · e3a35d03

Mikulas Patocka authored Jul 27, 2021

Add 10 counters for various events (hit, miss, etc) and export them in
the status line (accessed from userspace with "dmsetup status"). Also
add a message "clear_stats" that resets these counters.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

e3a35d03

dm writecache: report invalid return from writecache_map helpers · df699cc1

Mikulas Patocka authored Jul 27, 2021

If some "writecache_map_*" function returns invalid state, it is a bug.
So, we should report it and not fail silently.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

df699cc1

dm writecache: further writecache_map() cleanup · 15cb6f39

Mike Snitzer authored Jul 12, 2021

Factor out writecache_map_flush() and writecache_map_discard() from
writecache_map(). Also eliminate the various goto labels in
writecache_map().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

15cb6f39

dm writecache: factor out writecache_map_remap_origin() · 4d020b3a
Mike Snitzer authored Jul 12, 2021
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
4d020b3a

dm writecache: split up writecache_map() to improve code readability · cdd4d783

Mike Snitzer authored Jul 12, 2021

writecache_map() has grown too large and can be confusing to read given
all the goto statements.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

cdd4d783

writeback: make the laptop_mode prototypes available unconditionally · 99d26de2

Christoph Hellwig authored Aug 10, 2021

Fix the !CONFIG_BLOCK build after the recent cleanup.

Fixes: 5ed964f8 ("mm: hide laptop_mode_wb_timer entirely behind the BDI API")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

99d26de2

09 Aug, 2021 14 commits

block: return ELEVATOR_DISCARD_MERGE if possible · 866663b7

Ming Lei authored Jul 29, 2021

When merging one bio to request, if they are discard IO and the queue
supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
because both block core and related drivers(nvme, virtio-blk) doesn't
handle mixed discard io merge(traditional IO merge together with
discard merge) well.

Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
so both blk-mq and drivers just need to handle multi-range discard.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Fixes: 2705dfb2 ("block: fix discard request merge")
Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

866663b7

block: remove the bd_bdi in struct block_device · a11d7fc2

Christoph Hellwig authored Aug 09, 2021

Just retrieve the bdi from the disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a11d7fc2

block: move the bdi from the request_queue to the gendisk · edb0872f

Christoph Hellwig authored Aug 09, 2021

The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

edb0872f

block: add a queue_has_disk helper · 1008162b

Christoph Hellwig authored Aug 09, 2021

Add a helper to check if a gendisk is associated with a request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

1008162b

block: pass a gendisk to blk_queue_update_readahead · 471aa704

Christoph Hellwig authored Aug 09, 2021

.. and rename the function to disk_update_readahead.  This is in
preparation for moving the BDI from the request_queue to the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

471aa704

mm: hide laptop_mode_wb_timer entirely behind the BDI API · 5ed964f8

Christoph Hellwig authored Aug 09, 2021

Don't leak the detaіls of the timer into the block layer, instead
initialize the timer in bdi_alloc and delete it in bdi_unregister.
Note that this means the timer is initialized (but not armed) for
non-block queues as well now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

5ed964f8

block: remove support for delayed queue registrations · d1254a87

Christoph Hellwig authored Aug 04, 2021

Now that device mapper has been changed to register the disk once
it is fully ready all this code is unused.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

d1254a87

dm: delay registering the gendisk · 89f871af

Christoph Hellwig authored Aug 04, 2021

device mapper is currently the only outlier that tries to call
register_disk after add_disk, leading to fairly inconsistent state
of these block layer data structures.  Instead change device-mapper
to just register the gendisk later now that the holder mechanism
can cope with that.

Note that this introduces a user visible change: the dm kobject is
now only visible after the initial table has been loaded.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

89f871af

dm: move setting md->type into dm_setup_md_queue · ba305859

Christoph Hellwig authored Aug 04, 2021

Move setting md->type from both callers into dm_setup_md_queue.
This ensures that md->type is only set to a valid value after the queue
has been fully setup, something we'll rely on future changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

ba305859

dm: cleanup cleanup_mapped_device · 74a2b6ec

Christoph Hellwig authored Aug 04, 2021

md->queue is now always set when md->disk is set, so simplify the
conditionals a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

74a2b6ec

block: support delayed holder registration · d6263387

Christoph Hellwig authored Aug 04, 2021

device mapper needs to register holders before it is ready to do I/O.
Currently it does so by registering the disk early, which can leave
the disk and queue in a weird half state where the queue is registered
with the disk, except for sysfs and the elevator. And this state has
been a bit promlematic before, and will get more so when sorting out
the responsibilities between the queue and the disk.

Support registering holders on an initialized but not registered disk
instead by delaying the sysfs registration until the disk is registered.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

d6263387

block: look up holders by bdev · 0dbcfe24

Christoph Hellwig authored Aug 04, 2021

Invert they way the holder relations are tracked. This very
slightly reduces the memory overhead for partitioned devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210804094147.459763-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0dbcfe24

block: remove the extra kobject reference in bd_link_disk_holder · fbd9a395

Christoph Hellwig authored Aug 04, 2021

Since commit 0d02129e ("block: merge struct block_device and struct
hd_struct") there is no way for the bdev to go away as long as there is
a holder, so remove the extra references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

fbd9a395

block: make the block holder code optional · c66fd019

Christoph Hellwig authored Aug 04, 2021

Move the block holder code into a separate file as it is not in any way
related to the other block_dev.c code, and add a new selectable config
option for it so that we don't have to build it without any remapped
drivers selected.

The Kconfig symbol contains a _DEPRECATED suffix to match the comments
added in commit 49731baa
("block: restore multiple bd_link_disk_holder() support").
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

c66fd019

05 Aug, 2021 2 commits

loop: Select I/O scheduler 'none' from inside add_disk() · 2112f5c1

Bart Van Assche authored Aug 05, 2021

We noticed that the user interface of Android devices becomes very slow
under memory pressure. This is because Android uses the zram driver on top
of the loop driver for swapping, because under memory pressure the swap
code alternates reads and writes quickly, because mq-deadline is the
default scheduler for loop devices and because mq-deadline delays writes by
five seconds for such a workload with default settings. Fix this by making
the kernel select I/O scheduler 'none' from inside add_disk() for loop
devices. This default can be overridden at any time from user space,
e.g. via a udev rule. This approach has an advantage compared to changing
the I/O scheduler from userspace from 'mq-deadline' into 'none', namely
that synchronize_rcu() does not get called.

This patch changes the default I/O scheduler for loop devices from
'mq-deadline' into 'none'.

Additionally, this patch reduces the Android boot time on my test setup
with 0.5 seconds compared to configuring the loop I/O scheduler from user
space.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Martijn Coenen <maco@android.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805174200.3250718-3-bvanassche@acm.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

2112f5c1

blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flag · 90b71980

Bart Van Assche authored Aug 05, 2021

elevator_get_default() uses the following algorithm to select an I/O
scheduler from inside add_disk():
- In case of a single hardware queue or if sharing hardware queues across
  multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
- Otherwise, use 'none'.

This is a good choice for most but not for all block drivers. Make it
possible to override the selection of mq-deadline with a new flag,
namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Martijn Coenen <maco@android.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

90b71980

02 Aug, 2021 7 commits

block: remove blk-mq-sysfs dead code · 2bc1f6e4

Damien Le Moal authored Jul 13, 2021

In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to
define any attribute since the "mq" sysfs directory contains only
sub-directories (no attribute files). As a result, blk_mq_sysfs_show(),
blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all
unused and unnecessary. Remove all this unused code.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210713081837.524422-1-damien.lemoal@wdc.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2bc1f6e4

loop: raise media_change event · 9f65c489

Matteo Croce authored Jul 13, 2021

Make the loop device raise a DISK_MEDIA_CHANGE event on attach or detach.

	# udevadm monitor -up |grep -e DISK_MEDIA_CHANGE -e DEVNAME &

	# losetup -f zero
	[    7.454235] loop0: detected capacity change from 0 to 16384
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop0
	DEVNAME=/dev/loop0
	DEVNAME=/dev/loop0

	# losetup -f zero
	[   10.205245] loop1: detected capacity change from 0 to 16384
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop1
	DEVNAME=/dev/loop1
	DEVNAME=/dev/loop1

	# losetup -f zero2
	[   13.532368] loop2: detected capacity change from 0 to 40960
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop2
	DEVNAME=/dev/loop2

	# losetup -D
	DEVNAME=/dev/loop1
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop1
	DEVNAME=/dev/loop2
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop2
	DEVNAME=/dev/loop0
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop0
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-7-mcroce@linux.microsoft.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9f65c489

block: add a helper to raise a media changed event · e6138dc1

Matteo Croce authored Jul 13, 2021

Refactor disk_check_events() and move some code into disk_event_uevent().
Then add disk_force_media_change(), a helper which will be used by
devices to force issuing a DISK_EVENT_MEDIA_CHANGE event.
Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-6-mcroce@linux.microsoft.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e6138dc1

block: export diskseq in sysfs · 13927b31

Matteo Croce authored Jul 13, 2021

Add a new sysfs handle to export the new diskseq value.
Place it in <sysfs>/block/<disk>/diskseq and document it.

    $ grep . /sys/class/block/*/diskseq
    /sys/class/block/loop0/diskseq:13
    /sys/class/block/loop1/diskseq:14
    /sys/class/block/loop2/diskseq:5
    /sys/class/block/loop3/diskseq:6
    /sys/class/block/ram0/diskseq:1
    /sys/class/block/ram1/diskseq:2
    /sys/class/block/vda/diskseq:7
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-5-mcroce@linux.microsoft.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

13927b31

block: add ioctl to read the disk sequence number · 7957d93b

Matteo Croce authored Jul 13, 2021

Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number
from the genhd structure.

    # ./getdiskseq /dev/loop*
    /dev/loop0:     13
    /dev/loop0p1:   13
    /dev/loop0p2:   13
    /dev/loop0p3:   13
    /dev/loop1:     14
    /dev/loop1p1:   14
    /dev/loop1p2:   14
    /dev/loop2:     5
    /dev/loop3:     6
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-4-mcroce@linux.microsoft.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7957d93b

block: export the diskseq in uevents · 87eb7107

Matteo Croce authored Jul 13, 2021

Export the newly introduced diskseq in uevents:

    $ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ
    E: DEVNAME=/dev/loop0
    E: DISKSEQ=1
    E: DEVNAME=/dev/loop1
    E: DISKSEQ=2
    E: DEVNAME=/dev/loop2
    E: DISKSEQ=3
    E: DEVNAME=/dev/loop3
    E: DISKSEQ=4
    E: DEVNAME=/dev/loop4
    E: DISKSEQ=5
    E: DEVNAME=/dev/loop5
    E: DISKSEQ=6
    E: DEVNAME=/dev/loop6
    E: DISKSEQ=7
    E: DEVNAME=/dev/loop7
    E: DISKSEQ=8
    E: DEVNAME=/dev/nvme0n1
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p1
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p2
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p3
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p4
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p5
    E: DISKSEQ=9
    E: DEVNAME=/dev/sda
    E: DISKSEQ=10
    E: DEVNAME=/dev/sda1
    E: DISKSEQ=10
    E: DEVNAME=/dev/sda2
    E: DISKSEQ=10
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-3-mcroce@linux.microsoft.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

87eb7107

block: add disk sequence number · cf179948

Matteo Croce authored Jul 13, 2021

Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.

Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.

Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.

Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

cf179948