Commit 7ad67ca5 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Two NVMe pull requests:
     - ana log parse fix from Anton
     - nvme quirks support for Apple devices from Ben
     - fix missing bio completion tracing for multipath stack devices
       from Hannes and Mikhail
     - IP TOS settings for nvme rdma and tcp transports from Israel
     - rq_dma_dir cleanups from Israel
     - tracing for Get LBA Status command from Minwoo
     - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
     - Some consolidation between the fabrics transports for handling
       the CAP register
     - reset race with ns scanning fix for fabrics (move fabrics
       commands to a dedicated request queue with a different lifetime
       from the admin request queue)."
     - controller reset and namespace scan races fixes
     - nvme discovery log change uevent support
     - naming improvements from Keith
     - multiple discovery controllers reject fix from James
     - some regular cleanups from various people

 - Series fixing (and re-fixing) null_blk debug printing and nr_devices
   checks (André)

 - A few pull requests from Song, with fixes from Andy, Guoqing,
   Guilherme, Neil, Nigel, and Yufen.

 - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

 - Bio merge handling unification (Christoph)

 - Pick default elevator correctly for devices with special needs
   (Damien)

 - Block stats fixes (Hou)

 - Timeout and support devices nbd fixes (Mike)

 - Series fixing races around elevator switching and device add/remove
   (Ming)

 - sed-opal cleanups (Revanth)

 - Per device weight support for BFQ (Fam)

 - Support for blk-iocost, a new model that can properly account cost of
   IO workloads. (Tejun)

 - blk-cgroup writeback fixes (Tejun)

 - paride queue init fixes (zhengbin)

 - blk_set_runtime_active() cleanup (Stanley)

 - Block segment mapping optimizations (Bart)

 - lightnvm fixes (Hans/Minwoo/YueHaibing)

 - Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
  null_blk: format pr_* logs with pr_fmt
  null_blk: match the type of parameter nr_devices
  null_blk: do not fail the module load with zero devices
  block: also check RQF_STATS in blk_mq_need_time_stamp()
  block: make rq sector size accessible for block stats
  bfq: Fix bfq linkage error
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  ...
parents 5260c2b8 9c7eddf1
......@@ -1469,6 +1469,103 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
io.cost.qos
A read-write nested-keyed file with exists only on the root
cgroup.
This file configures the Quality of Service of the IO cost
model based controller (CONFIG_BLK_CGROUP_IOCOST) which
currently implements "io.weight" proportional control. Lines
are keyed by $MAJ:$MIN device numbers and not ordered. The
line for a given device is populated on the first write for
the device on "io.cost.qos" or "io.cost.model". The following
nested keys are defined.
====== =====================================
enable Weight-based control enable
ctrl "auto" or "user"
rpct Read latency percentile [0, 100]
rlat Read latency threshold
wpct Write latency percentile [0, 100]
wlat Write latency threshold
min Minimum scaling percentage [1, 10000]
max Maximum scaling percentage [1, 10000]
====== =====================================
The controller is disabled by default and can be enabled by
setting "enable" to 1. "rpct" and "wpct" parameters default
to zero and the controller uses internal device saturation
state to adjust the overall IO rate between "min" and "max".
When a better control quality is needed, latency QoS
parameters can be configured. For example::
8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
shows that on sdb, the controller is enabled, will consider
the device saturated if the 95th percentile of read completion
latencies is above 75ms or write 150ms, and adjust the overall
IO issue rate between 50% and 150% accordingly.
The lower the saturation point, the better the latency QoS at
the cost of aggregate bandwidth. The narrower the allowed
adjustment range between "min" and "max", the more conformant
to the cost model the IO behavior. Note that the IO issue
base rate may be far off from 100% and setting "min" and "max"
blindly can lead to a significant loss of device capacity or
control quality. "min" and "max" are useful for regulating
devices which show wide temporary behavior changes - e.g. a
ssd which accepts writes at the line speed for a while and
then completely stalls for multiple seconds.
When "ctrl" is "auto", the parameters are controlled by the
kernel and may change automatically. Setting "ctrl" to "user"
or setting any of the percentile and latency parameters puts
it into "user" mode and disables the automatic changes. The
automatic mode can be restored by setting "ctrl" to "auto".
io.cost.model
A read-write nested-keyed file with exists only on the root
cgroup.
This file configures the cost model of the IO cost model based
controller (CONFIG_BLK_CGROUP_IOCOST) which currently
implements "io.weight" proportional control. Lines are keyed
by $MAJ:$MIN device numbers and not ordered. The line for a
given device is populated on the first write for the device on
"io.cost.qos" or "io.cost.model". The following nested keys
are defined.
===== ================================
ctrl "auto" or "user"
model The cost model in use - "linear"
===== ================================
When "ctrl" is "auto", the kernel may change all parameters
dynamically. When "ctrl" is set to "user" or any other
parameters are written to, "ctrl" become "user" and the
automatic changes are disabled.
When "model" is "linear", the following model parameters are
defined.
============= ========================================
[r|w]bps The maximum sequential IO throughput
[r|w]seqiops The maximum 4k sequential IOs per second
[r|w]randiops The maximum 4k random IOs per second
============= ========================================
From the above, the builtin linear model determines the base
costs of a sequential and random IO and the cost coefficient
for the IO size. While simple, this model can cover most
common device classes acceptably.
The IO cost model isn't expected to be accurate in absolute
sense and is scaled to the device behavior dynamically.
If needed, tools/cgroup/iocost_coef_gen.py can be used to
generate device-specific coefficients.
io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
......
......@@ -1201,12 +1201,6 @@
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
elevator= [IOSCHED]
Format: { "mq-deadline" | "kyber" | "bfq" }
See Documentation/block/deadline-iosched.rst,
Documentation/block/kyber-iosched.rst and
Documentation/block/bfq-iosched.rst for details.
elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
......
......@@ -274,9 +274,7 @@ To reduce its OS jitter, do any of the following:
(based on an earlier one from Gilad Ben-Yossef) that
reduces or even eliminates vmstat overhead for some
workloads at https://lkml.org/lkml/2013/9/4/379.
e. Boot with "elevator=noop" to avoid workqueue use by
the block layer.
f. If running on high-end powerpc servers, build with
e. If running on high-end powerpc servers, build with
CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
daemon from running on each CPU every second or so.
(This will require editing Kconfig files and will defeat
......@@ -284,12 +282,12 @@ To reduce its OS jitter, do any of the following:
due to the rtas_event_scan() function.
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
g. If running on Cell Processor, build your kernel with
f. If running on Cell Processor, build your kernel with
CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
spu_gov_work().
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
h. If running on PowerMAC, build your kernel with
g. If running on PowerMAC, build your kernel with
CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
avoiding OS jitter from rackmeter_do_timer().
......
.. SPDX-License-Identifier: GPL-2.0
========================
Null block device driver
========================
1. Overview
===========
Overview
========
The null block device (/dev/nullb*) is used for benchmarking the various
The null block device (``/dev/nullb*``) is used for benchmarking the various
block-layer implementations. It emulates a block device of X gigabytes in size.
The following instances are possible:
Single-queue block-layer
- Request-based.
- Single submission queue per device.
- Implements IO scheduling algorithms (CFQ, Deadline, noop).
It does not execute any read/write operation, just mark them as complete in
the request queue. The following instances are possible:
Multi-queue block-layer
......@@ -27,15 +24,15 @@ The following instances are possible:
All of them have a completion queue for each core in the system.
2. Module parameters applicable for all instances
=================================================
Module parameters
=================
queue_mode=[0-2]: Default: 2-Multi-queue
Selects which block-layer the module should instantiate with.
= ============
0 Bio-based
1 Single-queue
1 Single-queue (deprecated)
2 Multi-queue
= ============
......@@ -67,7 +64,7 @@ irqmode=[0-2]: Default: 1-Soft-irq
completion_nsec=[ns]: Default: 10,000ns
Combined with irqmode=2 (timer). The time each completion event must wait.
submit_queues=[1..nr_cpus]:
submit_queues=[1..nr_cpus]: Default: 1
The number of submission queues attached to the device driver. If unset, it
defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module
parameter is 1.
......@@ -75,9 +72,11 @@ submit_queues=[1..nr_cpus]:
hw_queue_depth=[0..qdepth]: Default: 64
The hardware queue depth of the device.
III: Multi-queue specific parameters
Multi-queue specific parameters
-------------------------------
use_per_node_hctx=[0/1]: Default: 0
Number of hardware context queues.
= =====================================================================
0 The number of submit queues are set to the value of the submit_queues
......@@ -87,6 +86,7 @@ use_per_node_hctx=[0/1]: Default: 0
= =====================================================================
no_sched=[0/1]: Default: 0
Enable/disable the io scheduler.
= ======================================
0 nullb* use default blk-mq io scheduler
......@@ -94,6 +94,7 @@ no_sched=[0/1]: Default: 0
= ======================================
blocking=[0/1]: Default: 0
Blocking behavior of the request queue.
= ===============================================================
0 Register as a non-blocking blk-mq driver device.
......@@ -103,6 +104,7 @@ blocking=[0/1]: Default: 0
= ===============================================================
shared_tags=[0/1]: Default: 0
Sharing tags between devices.
= ================================================================
0 Tag set is not shared.
......@@ -111,6 +113,7 @@ shared_tags=[0/1]: Default: 0
= ================================================================
zoned=[0/1]: Default: 0
Device is a random-access or a zoned block device.
= ======================================================================
0 Block device is exposed as a random-access block device.
......
......@@ -2,10 +2,6 @@
Switching Scheduler
===================
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
globally at boot time only presently.
Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in::
......
......@@ -26,6 +26,9 @@ menuconfig BLOCK
if BLOCK
config BLK_RQ_ALLOC_TIME
bool
config BLK_SCSI_REQUEST
bool
......@@ -132,6 +135,16 @@ config BLK_CGROUP_IOLATENCY
Note, this is an experimental interface and could be changed someday.
config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP=y
select BLK_RQ_ALLOC_TIME
---help---
Enabling this option enables the .weight interface for cost
model based proportional IO control. The IO controller
distributes IO capacity between different groups based on
their share of the overall weight distribution.
config BLK_WBT_MQ
bool "Multiqueue writeback throttling"
default y
......
......@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
......
......@@ -501,11 +501,12 @@ static void bfq_cpd_free(struct blkcg_policy_data *cpd)
kfree(cpd_to_bfqgd(cpd));
}
static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, struct request_queue *q,
struct blkcg *blkcg)
{
struct bfq_group *bfqg;
bfqg = kzalloc_node(sizeof(*bfqg), gfp, node);
bfqg = kzalloc_node(sizeof(*bfqg), gfp, q->node);
if (!bfqg)
return NULL;
......@@ -904,7 +905,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
bfq_end_wr_async_queues(bfqd, bfqd->root_group);
}
static int bfq_io_show_weight(struct seq_file *sf, void *v)
static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v)
{
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
......@@ -918,6 +919,60 @@ static int bfq_io_show_weight(struct seq_file *sf, void *v)
return 0;
}
static u64 bfqg_prfill_weight_device(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
struct bfq_group *bfqg = pd_to_bfqg(pd);
if (!bfqg->entity.dev_weight)
return 0;
return __blkg_prfill_u64(sf, pd, bfqg->entity.dev_weight);
}
static int bfq_io_show_weight(struct seq_file *sf, void *v)
{
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
seq_printf(sf, "default %u\n", bfqgd->weight);
blkcg_print_blkgs(sf, blkcg, bfqg_prfill_weight_device,
&blkcg_policy_bfq, 0, false);
return 0;
}
static void bfq_group_set_weight(struct bfq_group *bfqg, u64 weight, u64 dev_weight)
{
weight = dev_weight ?: weight;
bfqg->entity.dev_weight = dev_weight;
/*
* Setting the prio_changed flag of the entity
* to 1 with new_weight == weight would re-set
* the value of the weight to its ioprio mapping.
* Set the flag only if necessary.
*/
if ((unsigned short)weight != bfqg->entity.new_weight) {
bfqg->entity.new_weight = (unsigned short)weight;
/*
* Make sure that the above new value has been
* stored in bfqg->entity.new_weight before
* setting the prio_changed flag. In fact,
* this flag may be read asynchronously (in
* critical sections protected by a different
* lock than that held here), and finding this
* flag set may cause the execution of the code
* for updating parameters whose value may
* depend also on bfqg->entity.new_weight (in
* __bfq_entity_update_weight_prio).
* This barrier makes sure that the new value
* of bfqg->entity.new_weight is correctly
* seen in that code.
*/
smp_wmb();
bfqg->entity.prio_changed = 1;
}
}
static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
struct cftype *cftype,
u64 val)
......@@ -936,55 +991,72 @@ static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
struct bfq_group *bfqg = blkg_to_bfqg(blkg);
if (!bfqg)
continue;
/*
* Setting the prio_changed flag of the entity
* to 1 with new_weight == weight would re-set
* the value of the weight to its ioprio mapping.
* Set the flag only if necessary.
*/
if ((unsigned short)val != bfqg->entity.new_weight) {
bfqg->entity.new_weight = (unsigned short)val;
/*
* Make sure that the above new value has been
* stored in bfqg->entity.new_weight before
* setting the prio_changed flag. In fact,
* this flag may be read asynchronously (in
* critical sections protected by a different
* lock than that held here), and finding this
* flag set may cause the execution of the code
* for updating parameters whose value may
* depend also on bfqg->entity.new_weight (in
* __bfq_entity_update_weight_prio).
* This barrier makes sure that the new value
* of bfqg->entity.new_weight is correctly
* seen in that code.
*/
smp_wmb();
bfqg->entity.prio_changed = 1;
}
if (bfqg)
bfq_group_set_weight(bfqg, val, 0);
}
spin_unlock_irq(&blkcg->lock);
return ret;
}
static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
static ssize_t bfq_io_set_device_weight(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
u64 weight;
/* First unsigned long found in the file is used */
int ret = kstrtoull(strim(buf), 0, &weight);
int ret;
struct blkg_conf_ctx ctx;
struct blkcg *blkcg = css_to_blkcg(of_css(of));
struct bfq_group *bfqg;
u64 v;
ret = blkg_conf_prep(blkcg, &blkcg_policy_bfq, buf, &ctx);
if (ret)
return ret;
ret = bfq_io_set_weight_legacy(of_css(of), NULL, weight);
if (sscanf(ctx.body, "%llu", &v) == 1) {
/* require "default" on dfl */
ret = -ERANGE;
if (!v)
goto out;
} else if (!strcmp(strim(ctx.body), "default")) {
v = 0;
} else {
ret = -EINVAL;
goto out;
}
bfqg = blkg_to_bfqg(ctx.blkg);
ret = -ERANGE;
if (!v || (v >= BFQ_MIN_WEIGHT && v <= BFQ_MAX_WEIGHT)) {
bfq_group_set_weight(bfqg, bfqg->entity.weight, v);
ret = 0;
}
out:
blkg_conf_finish(&ctx);
return ret ?: nbytes;
}
static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
char *endp;
int ret;
u64 v;
buf = strim(buf);
/* "WEIGHT" or "default WEIGHT" sets the default weight */
v = simple_strtoull(buf, &endp, 0);
if (*endp == '\0' || sscanf(buf, "default %llu", &v) == 1) {
ret = bfq_io_set_weight_legacy(of_css(of), NULL, v);
return ret ?: nbytes;
}
return bfq_io_set_device_weight(of, buf, nbytes, off);
}
#ifdef CONFIG_BFQ_CGROUP_DEBUG
static int bfqg_print_stat(struct seq_file *sf, void *v)
{
......@@ -1141,9 +1213,15 @@ struct cftype bfq_blkcg_legacy_files[] = {
{
.name = "bfq.weight",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = bfq_io_show_weight,
.seq_show = bfq_io_show_weight_legacy,
.write_u64 = bfq_io_set_weight_legacy,
},
{
.name = "bfq.weight_device",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = bfq_io_show_weight,
.write = bfq_io_set_weight,
},
/* statistics, covers only the tasks in the bfqg */
{
......
......@@ -168,6 +168,9 @@ struct bfq_entity {
/* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */
int budget;
/* device weight, if non-zero, it overrides the default weight of
* bfq_group_data */
int dev_weight;
/* weight of the queue */
int weight;
/* next weight if a change is in progress */
......
......@@ -744,6 +744,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
}
#endif
/* Matches the smp_wmb() in bfq_group_set_weight. */
smp_rmb();
old_st->wsum -= entity->weight;
if (entity->new_weight != entity->orig_weight) {
......
......@@ -646,25 +646,20 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
return true;
}
/*
* Check if the @page can be added to the current segment(@bv), and make
* sure to call it only if page_is_mergeable(@bv, @page) is true
*/
static bool can_add_page_to_seg(struct request_queue *q,
struct bio_vec *bv, struct page *page, unsigned len,
unsigned offset)
static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned len, unsigned offset,
bool *same_page)
{
struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
unsigned long mask = queue_segment_boundary(q);
phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
phys_addr_t addr2 = page_to_phys(page) + offset + len - 1;
if ((addr1 | mask) != (addr2 | mask))
return false;
if (bv->bv_len + len > queue_max_segment_size(q))
return false;
return true;
return __bio_try_merge_page(bio, page, len, offset, same_page);
}
/**
......@@ -674,7 +669,7 @@ static bool can_add_page_to_seg(struct request_queue *q,
* @page: page to add
* @len: vec entry length
* @offset: vec entry offset
* @put_same_page: put the page if it is same with last added page
* @same_page: return if the merge happen inside the same page
*
* Attempt to add a page to the bio_vec maplist. This can fail for a
* number of reasons, such as the bio being full or target block device
......@@ -685,10 +680,9 @@ static bool can_add_page_to_seg(struct request_queue *q,
*/
static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
bool put_same_page)
bool *same_page)
{
struct bio_vec *bvec;
bool same_page = false;
/*
* cloned bio must not modify vec list
......@@ -700,28 +694,16 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
return 0;
if (bio->bi_vcnt > 0) {
bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
if (page == bvec->bv_page &&
offset == bvec->bv_offset + bvec->bv_len) {
if (put_same_page)
put_page(page);
bvec->bv_len += len;
goto done;
}
if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page))
return len;
/*
* If the queue doesn't support SG gaps and adding this
* offset would create a gap, disallow it.
* If the queue doesn't support SG gaps and adding this segment
* would create a gap, disallow it.
*/
bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
if (bvec_gap_to_prev(q, bvec, offset))
return 0;
if (page_is_mergeable(bvec, page, len, offset, &same_page) &&
can_add_page_to_seg(q, bvec, page, len, offset)) {
bvec->bv_len += len;
goto done;
}
}
if (bio_full(bio, len))
......@@ -735,7 +717,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
bvec->bv_len = len;
bvec->bv_offset = offset;
bio->bi_vcnt++;
done:
bio->bi_iter.bi_size += len;
return len;
}
......@@ -743,7 +724,8 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
int bio_add_pc_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset)
{
return __bio_add_pc_page(q, bio, page, len, offset, false);
bool same_page = false;
return __bio_add_pc_page(q, bio, page, len, offset, &same_page);
}
EXPORT_SYMBOL(bio_add_pc_page);
......@@ -806,6 +788,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
bio->bi_iter.bi_size += len;
bio->bi_vcnt++;
if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
bio_set_flag(bio, BIO_WORKINGSET);
}
EXPORT_SYMBOL_GPL(__bio_add_page);
......@@ -1384,13 +1369,17 @@ struct bio *bio_map_user_iov(struct request_queue *q,
for (j = 0; j < npages; j++) {
struct page *page = pages[j];
unsigned int n = PAGE_SIZE - offs;
bool same_page = false;
if (n > bytes)
n = bytes;
if (!__bio_add_pc_page(q, bio, page, n, offs,
true))
&same_page)) {
if (same_page)
put_page(page);
break;
}
added += n;
bytes -= n;
......@@ -1521,7 +1510,6 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
bio->bi_end_io = bio_map_kern_endio;
return bio;
}
EXPORT_SYMBOL(bio_map_kern);
static void bio_copy_kern_endio(struct bio *bio)
{
......@@ -1842,8 +1830,8 @@ EXPORT_SYMBOL(bio_endio);
* @bio, and updates @bio to represent the remaining sectors.
*
* Unless this is a discard request the newly allocated bio will point
* to @bio's bi_io_vec; it is the caller's responsibility to ensure that
* @bio is not freed before the split.
* to @bio's bi_io_vec. It is the caller's responsibility to ensure that
* neither @bio nor @bs are freed before the split bio.
*/
struct bio *bio_split(struct bio *bio, int sectors,
gfp_t gfp, struct bio_set *bs)
......
......@@ -175,7 +175,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
continue;
/* alloc per-policy data and attach it to blkg */
pd = pol->pd_alloc_fn(gfp_mask, q->node);
pd = pol->pd_alloc_fn(gfp_mask, q, blkcg);
if (!pd)
goto err_free;
......@@ -753,6 +753,44 @@ static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg,
return __blkg_lookup(blkcg, q, true /* update_hint */);
}
/**
* blkg_conf_prep - parse and prepare for per-blkg config update
* @inputp: input string pointer
*
* Parse the device node prefix part, MAJ:MIN, of per-blkg config update
* from @input and get and return the matching gendisk. *@inputp is
* updated to point past the device node prefix. Returns an ERR_PTR()
* value on error.
*
* Use this function iff blkg_conf_prep() can't be used for some reason.
*/
struct gendisk *blkcg_conf_get_disk(char **inputp)
{
char *input = *inputp;
unsigned int major, minor;
struct gendisk *disk;
int key_len, part;
if (sscanf(input, "%u:%u%n", &major, &minor, &key_len) != 2)
return ERR_PTR(-EINVAL);
input += key_len;
if (!isspace(*input))
return ERR_PTR(-EINVAL);
input = skip_spaces(input);
disk = get_gendisk(MKDEV(major, minor), &part);
if (!disk)
return ERR_PTR(-ENODEV);
if (part) {
put_disk_and_module(disk);
return ERR_PTR(-ENODEV);
}
*inputp = input;
return disk;
}
/**
* blkg_conf_prep - parse and prepare for per-blkg config update
* @blkcg: target block cgroup
......@@ -772,25 +810,11 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
struct gendisk *disk;
struct request_queue *q;
struct blkcg_gq *blkg;
unsigned int major, minor;
int key_len, part, ret;
char *body;
if (sscanf(input, "%u:%u%n", &major, &minor, &key_len) != 2)
return -EINVAL;
body = input + key_len;
if (!isspace(*body))
return -EINVAL;
body = skip_spaces(body);
int ret;
disk = get_gendisk(MKDEV(major, minor), &part);
if (!disk)
return -ENODEV;
if (part) {
ret = -ENODEV;
goto fail;
}
disk = blkcg_conf_get_disk(&input);
if (IS_ERR(disk))
return PTR_ERR(disk);
q = disk->queue;
......@@ -856,7 +880,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
success:
ctx->disk = disk;
ctx->blkg = blkg;
ctx->body = body;
ctx->body = input;
return 0;
fail_unlock:
......@@ -876,6 +900,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
}
return ret;
}
EXPORT_SYMBOL_GPL(blkg_conf_prep);
/**
* blkg_conf_finish - finish up per-blkg config update
......@@ -891,6 +916,7 @@ void blkg_conf_finish(struct blkg_conf_ctx *ctx)
rcu_read_unlock();
put_disk_and_module(ctx->disk);
}
EXPORT_SYMBOL_GPL(blkg_conf_finish);
static int blkcg_print_stat(struct seq_file *sf, void *v)
{
......@@ -1346,7 +1372,7 @@ int blkcg_activate_policy(struct request_queue *q,
blk_mq_freeze_queue(q);
pd_prealloc:
if (!pd_prealloc) {
pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q->node);
pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q, &blkcg_root);
if (!pd_prealloc) {
ret = -ENOMEM;
goto out_bypass_end;
......@@ -1362,7 +1388,7 @@ int blkcg_activate_policy(struct request_queue *q,
if (blkg->pd[pol->plid])
continue;
pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q->node);
pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q, &blkcg_root);
if (!pd)
swap(pd, pd_prealloc);
if (!pd) {
......@@ -1475,7 +1501,8 @@ int blkcg_policy_register(struct blkcg_policy *pol)
blkcg->cpd[pol->plid] = cpd;
cpd->blkcg = blkcg;
cpd->plid = pol->plid;
pol->cpd_init_fn(cpd);
if (pol->cpd_init_fn)
pol->cpd_init_fn(cpd);
}
}
......
......@@ -36,6 +36,7 @@
#include <linux/blk-cgroup.h>
#include <linux/debugfs.h>
#include <linux/bpf.h>
#include <linux/psi.h>
#define CREATE_TRACE_POINTS
#include <trace/events/block.h>
......@@ -129,6 +130,7 @@ static const char *const blk_op_name[] = {
REQ_OP_NAME(DISCARD),
REQ_OP_NAME(SECURE_ERASE),
REQ_OP_NAME(ZONE_RESET),
REQ_OP_NAME(ZONE_RESET_ALL),
REQ_OP_NAME(WRITE_SAME),
REQ_OP_NAME(WRITE_ZEROES),
REQ_OP_NAME(SCSI_IN),
......@@ -344,7 +346,8 @@ void blk_cleanup_queue(struct request_queue *q)
/*
* Drain all requests queued before DYING marking. Set DEAD flag to
* prevent that q->request_fn() gets invoked after draining finished.
* prevent that blk_mq_run_hw_queues() accesses the hardware queues
* after draining finished.
*/
blk_freeze_queue(q);
......@@ -479,7 +482,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (!q)
return NULL;
INIT_LIST_HEAD(&q->queue_head);
q->last_merge = NULL;
q->id = ida_simple_get(&blk_queue_ida, 0, 0, gfp_mask);
......@@ -518,6 +520,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
mutex_init(&q->blk_trace_mutex);
#endif
mutex_init(&q->sysfs_lock);
mutex_init(&q->sysfs_dir_lock);
spin_lock_init(&q->queue_lock);
init_waitqueue_head(&q->mq_freeze_wq);
......@@ -601,6 +604,7 @@ bool bio_attempt_back_merge(struct request *req, struct bio *bio,
return false;
trace_block_bio_backmerge(req->q, req, bio);
rq_qos_merge(req->q, req, bio);
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
blk_rq_set_mixed_merge(req);
......@@ -622,6 +626,7 @@ bool bio_attempt_front_merge(struct request *req, struct bio *bio,
return false;
trace_block_bio_frontmerge(req->q, req, bio);
rq_qos_merge(req->q, req, bio);
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
blk_rq_set_mixed_merge(req);
......@@ -647,6 +652,8 @@ bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
goto no_merge;
rq_qos_merge(q, req, bio);
req->biotail->bi_next = bio;
req->biotail = bio;
req->__data_len += bio->bi_iter.bi_size;
......@@ -931,6 +938,10 @@ generic_make_request_checks(struct bio *bio)
if (!blk_queue_is_zoned(q))
goto not_supported;
break;
case REQ_OP_ZONE_RESET_ALL:
if (!blk_queue_is_zoned(q) || !blk_queue_zone_resetall(q))
goto not_supported;
break;
case REQ_OP_WRITE_ZEROES:
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
......@@ -1128,6 +1139,10 @@ EXPORT_SYMBOL_GPL(direct_make_request);
*/
blk_qc_t submit_bio(struct bio *bio)
{
bool workingset_read = false;
unsigned long pflags;
blk_qc_t ret;
if (blkcg_punt_bio_submit(bio))
return BLK_QC_T_NONE;
......@@ -1146,6 +1161,8 @@ blk_qc_t submit_bio(struct bio *bio)
if (op_is_write(bio_op(bio))) {
count_vm_events(PGPGOUT, count);
} else {
if (bio_flagged(bio, BIO_WORKINGSET))
workingset_read = true;
task_io_account_read(bio->bi_iter.bi_size);
count_vm_events(PGPGIN, count);
}
......@@ -1160,7 +1177,21 @@ blk_qc_t submit_bio(struct bio *bio)
}
}
return generic_make_request(bio);
/*
* If we're reading data that is part of the userspace
* workingset, count submission time as memory stall. When the
* device is congested, or the submitting cgroup IO-throttled,
* submission can be a significant part of overall IO time.
*/
if (workingset_read)
psi_memstall_enter(&pflags);
ret = generic_make_request(bio);
if (workingset_read)
psi_memstall_leave(&pflags);
return ret;
}
EXPORT_SYMBOL(submit_bio);
......
This diff is collapsed.
......@@ -725,7 +725,7 @@ int blk_iolatency_init(struct request_queue *q)
return -ENOMEM;
rqos = &blkiolat->rqos;
rqos->id = RQ_QOS_CGROUP;
rqos->id = RQ_QOS_LATENCY;
rqos->ops = &blkcg_iolatency_ops;
rqos->q = q;
......@@ -934,11 +934,13 @@ static size_t iolatency_pd_stat(struct blkg_policy_data *pd, char *buf,
}
static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, int node)
static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp,
struct request_queue *q,
struct blkcg *blkcg)
{
struct iolatency_grp *iolat;
iolat = kzalloc_node(sizeof(*iolat), gfp, node);
iolat = kzalloc_node(sizeof(*iolat), gfp, q->node);
if (!iolat)
return NULL;
iolat->stats = __alloc_percpu_gfp(sizeof(struct latency_stat),
......
......@@ -132,19 +132,32 @@ static struct bio *blk_bio_write_same_split(struct request_queue *q,
return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
}
/*
* Return the maximum number of sectors from the start of a bio that may be
* submitted as a single request to a block device. If enough sectors remain,
* align the end to the physical block size. Otherwise align the end to the
* logical block size. This approach minimizes the number of non-aligned
* requests that are submitted to a block device if the start of a bio is not
* aligned to a physical block boundary.
*/
static inline unsigned get_max_io_size(struct request_queue *q,
struct bio *bio)
{
unsigned sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
unsigned mask = queue_logical_block_size(q) - 1;
unsigned max_sectors = sectors;
unsigned pbs = queue_physical_block_size(q) >> SECTOR_SHIFT;
unsigned lbs = queue_logical_block_size(q) >> SECTOR_SHIFT;
unsigned start_offset = bio->bi_iter.bi_sector & (pbs - 1);
/* aligned to logical block size */
sectors &= ~(mask >> 9);
max_sectors += start_offset;
max_sectors &= ~(pbs - 1);
if (max_sectors > start_offset)
return max_sectors - start_offset;
return sectors;
return sectors & (lbs - 1);
}
static unsigned get_max_segment_size(struct request_queue *q,
static unsigned get_max_segment_size(const struct request_queue *q,
unsigned offset)
{
unsigned long mask = queue_segment_boundary(q);
......@@ -157,26 +170,41 @@ static unsigned get_max_segment_size(struct request_queue *q,
queue_max_segment_size(q));
}
/*
* Split the bvec @bv into segments, and update all kinds of
* variables.
/**
* bvec_split_segs - verify whether or not a bvec should be split in the middle
* @q: [in] request queue associated with the bio associated with @bv
* @bv: [in] bvec to examine
* @nsegs: [in,out] Number of segments in the bio being built. Incremented
* by the number of segments from @bv that may be appended to that
* bio without exceeding @max_segs
* @sectors: [in,out] Number of sectors in the bio being built. Incremented
* by the number of sectors from @bv that may be appended to that
* bio without exceeding @max_sectors
* @max_segs: [in] upper bound for *@nsegs
* @max_sectors: [in] upper bound for *@sectors
*
* When splitting a bio, it can happen that a bvec is encountered that is too
* big to fit in a single segment and hence that it has to be split in the
* middle. This function verifies whether or not that should happen. The value
* %true is returned if and only if appending the entire @bv to a bio with
* *@nsegs segments and *@sectors sectors would make that bio unacceptable for
* the block driver.
*/
static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
unsigned *nsegs, unsigned *sectors, unsigned max_segs)
static bool bvec_split_segs(const struct request_queue *q,
const struct bio_vec *bv, unsigned *nsegs,
unsigned *sectors, unsigned max_segs,
unsigned max_sectors)
{
unsigned len = bv->bv_len;
unsigned max_len = (min(max_sectors, UINT_MAX >> 9) - *sectors) << 9;
unsigned len = min(bv->bv_len, max_len);
unsigned total_len = 0;
unsigned new_nsegs = 0, seg_size = 0;
unsigned seg_size = 0;
/*
* Multi-page bvec may be too big to hold in one segment, so the
* current bvec has to be splitted as multiple segments.
*/
while (len && new_nsegs + *nsegs < max_segs) {
while (len && *nsegs < max_segs) {
seg_size = get_max_segment_size(q, bv->bv_offset + total_len);
seg_size = min(seg_size, len);
new_nsegs++;
(*nsegs)++;
total_len += seg_size;
len -= seg_size;
......@@ -184,16 +212,31 @@ static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
break;
}
if (new_nsegs) {
*nsegs += new_nsegs;
if (sectors)
*sectors += total_len >> 9;
}
*sectors += total_len >> 9;
/* split in the middle of the bvec if len != 0 */
return !!len;
/* tell the caller to split the bvec if it is too big to fit */
return len > 0 || bv->bv_len > max_len;
}
/**
* blk_bio_segment_split - split a bio in two bios
* @q: [in] request queue pointer
* @bio: [in] bio to be split
* @bs: [in] bio set to allocate the clone from
* @segs: [out] number of segments in the bio with the first half of the sectors
*
* Clone @bio, update the bi_iter of the clone to represent the first sectors
* of @bio and update @bio->bi_iter to represent the remaining sectors. The
* following is guaranteed for the cloned bio:
* - That it has at most get_max_io_size(@q, @bio) sectors.
* - That it has at most queue_max_segments(@q) segments.
*
* Except for discard requests the cloned bio will point at the bi_io_vec of
* the original bio. It is the responsibility of the caller to ensure that the
* original bio is not freed before the cloned bio. The caller is also
* responsible for ensuring that @bs is only destroyed after processing of the
* split bio has finished.
*/
static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
......@@ -213,34 +256,18 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
goto split;
if (sectors + (bv.bv_len >> 9) > max_sectors) {
/*
* Consider this a new segment if we're splitting in
* the middle of this vector.
*/
if (nsegs < max_segs &&
sectors < max_sectors) {
/* split in the middle of bvec */
bv.bv_len = (max_sectors - sectors) << 9;
bvec_split_segs(q, &bv, &nsegs,
&sectors, max_segs);
}
if (nsegs < max_segs &&
sectors + (bv.bv_len >> 9) <= max_sectors &&
bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
nsegs++;
sectors += bv.bv_len >> 9;
} else if (bvec_split_segs(q, &bv, &nsegs, &sectors, max_segs,
max_sectors)) {
goto split;
}
if (nsegs == max_segs)
goto split;
bvprv = bv;
bvprvp = &bvprv;
if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
nsegs++;
sectors += bv.bv_len >> 9;
} else if (bvec_split_segs(q, &bv, &nsegs, &sectors,
max_segs)) {
goto split;
}
}
*segs = nsegs;
......@@ -250,6 +277,19 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
return bio_split(bio, sectors, GFP_NOIO, bs);
}
/**
* __blk_queue_split - split a bio and submit the second half
* @q: [in] request queue pointer
* @bio: [in, out] bio to be split
* @nr_segs: [out] number of segments in the first bio
*
* Split a bio into two bios, chain the two bios, submit the second half and
* store a pointer to the first half in *@bio. If the second bio is still too
* big it will be split by a recursive call to this function. Since this
* function may allocate a new bio from @q->bio_split, it is the responsibility
* of the caller to ensure that @q is only released after processing of the
* split bio has finished.
*/
void __blk_queue_split(struct request_queue *q, struct bio **bio,
unsigned int *nr_segs)
{
......@@ -294,6 +334,17 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
}
}
/**
* blk_queue_split - split a bio and submit the second half
* @q: [in] request queue pointer
* @bio: [in, out] bio to be split
*
* Split a bio into two bios, chains the two bios, submit the second half and
* store a pointer to the first half in *@bio. Since this function may allocate
* a new bio from @q->bio_split, it is the responsibility of the caller to
* ensure that @q is only released after processing of the split bio has
* finished.
*/
void blk_queue_split(struct request_queue *q, struct bio **bio)
{
unsigned int nr_segs;
......@@ -305,6 +356,7 @@ EXPORT_SYMBOL(blk_queue_split);
unsigned int blk_recalc_rq_segments(struct request *rq)
{
unsigned int nr_phys_segs = 0;
unsigned int nr_sectors = 0;
struct req_iterator iter;
struct bio_vec bv;
......@@ -321,7 +373,8 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
}
rq_for_each_bvec(bv, rq, iter)
bvec_split_segs(rq->q, &bv, &nr_phys_segs, NULL, UINT_MAX);
bvec_split_segs(rq->q, &bv, &nr_phys_segs, &nr_sectors,
UINT_MAX, UINT_MAX);
return nr_phys_segs;
}
......
......@@ -15,10 +15,10 @@
#include "blk.h"
#include "blk-mq.h"
static int cpu_to_queue_index(struct blk_mq_queue_map *qmap,
unsigned int nr_queues, const int cpu)
static int queue_index(struct blk_mq_queue_map *qmap,
unsigned int nr_queues, const int q)
{
return qmap->queue_offset + (cpu % nr_queues);
return qmap->queue_offset + (q % nr_queues);
}
static int get_first_sibling(unsigned int cpu)
......@@ -36,21 +36,36 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap)
{
unsigned int *map = qmap->mq_map;
unsigned int nr_queues = qmap->nr_queues;
unsigned int cpu, first_sibling;
unsigned int cpu, first_sibling, q = 0;
for_each_possible_cpu(cpu)
map[cpu] = -1;
/*
* Spread queues among present CPUs first for minimizing
* count of dead queues which are mapped by all un-present CPUs
*/
for_each_present_cpu(cpu) {
if (q >= nr_queues)
break;
map[cpu] = queue_index(qmap, nr_queues, q++);
}
for_each_possible_cpu(cpu) {
if (map[cpu] != -1)
continue;
/*
* First do sequential mapping between CPUs and queues.
* In case we still have CPUs to map, and we have some number of
* threads per cores then map sibling threads to the same queue
* for performance optimizations.
*/
if (cpu < nr_queues) {
map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu);
if (q < nr_queues) {
map[cpu] = queue_index(qmap, nr_queues, q++);
} else {
first_sibling = get_first_sibling(cpu);
if (first_sibling == cpu)
map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu);
map[cpu] = queue_index(qmap, nr_queues, q++);
else
map[cpu] = map[first_sibling];
}
......
......@@ -270,7 +270,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i;
lockdep_assert_held(&q->sysfs_lock);
lockdep_assert_held(&q->sysfs_dir_lock);
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
......@@ -320,7 +320,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
int ret, i;
WARN_ON_ONCE(!q->kobj.parent);
lockdep_assert_held(&q->sysfs_lock);
lockdep_assert_held(&q->sysfs_dir_lock);
ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
if (ret < 0)
......@@ -349,23 +349,12 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
return ret;
}
int blk_mq_register_dev(struct device *dev, struct request_queue *q)
{
int ret;
mutex_lock(&q->sysfs_lock);
ret = __blk_mq_register_dev(dev, q);
mutex_unlock(&q->sysfs_lock);
return ret;
}
void blk_mq_sysfs_unregister(struct request_queue *q)
{
struct blk_mq_hw_ctx *hctx;
int i;
mutex_lock(&q->sysfs_lock);
mutex_lock(&q->sysfs_dir_lock);
if (!q->mq_sysfs_init_done)
goto unlock;
......@@ -373,7 +362,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
blk_mq_unregister_hctx(hctx);
unlock:
mutex_unlock(&q->sysfs_lock);
mutex_unlock(&q->sysfs_dir_lock);
}
int blk_mq_sysfs_register(struct request_queue *q)
......@@ -381,7 +370,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i, ret = 0;
mutex_lock(&q->sysfs_lock);
mutex_lock(&q->sysfs_dir_lock);
if (!q->mq_sysfs_init_done)
goto unlock;
......@@ -392,7 +381,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
}
unlock:
mutex_unlock(&q->sysfs_lock);
mutex_unlock(&q->sysfs_dir_lock);
return ret;
}
......@@ -10,6 +10,7 @@
#include <linux/module.h>
#include <linux/blk-mq.h>
#include <linux/delay.h>
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
......@@ -354,6 +355,37 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
}
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
static bool blk_mq_tagset_count_completed_rqs(struct request *rq,
void *data, bool reserved)
{
unsigned *count = data;
if (blk_mq_request_completed(rq))
(*count)++;
return true;
}
/**
* blk_mq_tagset_wait_completed_request - wait until all completed req's
* complete funtion is run
* @tagset: Tag set to drain completed request
*
* Note: This function has to be run after all IO queues are shutdown
*/
void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset)
{
while (true) {
unsigned count = 0;
blk_mq_tagset_busy_iter(tagset,
blk_mq_tagset_count_completed_rqs, &count);
if (!count)
break;
msleep(5);
}
}
EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
/**
* blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
* @q: Request queue to examine.
......
......@@ -44,12 +44,12 @@ static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
static int blk_mq_poll_stats_bkt(const struct request *rq)
{
int ddir, bytes, bucket;
int ddir, sectors, bucket;
ddir = rq_data_dir(rq);
bytes = blk_rq_bytes(rq);
sectors = blk_rq_stats_sectors(rq);
bucket = ddir + 2*(ilog2(bytes) - 9);
bucket = ddir + 2 * ilog2(sectors);
if (bucket < 0)
return -1;
......@@ -282,16 +282,16 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
EXPORT_SYMBOL(blk_mq_can_queue);
/*
* Only need start/end time stamping if we have stats enabled, or using
* an IO scheduler.
* Only need start/end time stamping if we have iostat or
* blk stats enabled, or using an IO scheduler.
*/
static inline bool blk_mq_need_time_stamp(struct request *rq)
{
return (rq->rq_flags & RQF_IO_STAT) || rq->q->elevator;
return (rq->rq_flags & (RQF_IO_STAT | RQF_STATS)) || rq->q->elevator;
}
static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
unsigned int tag, unsigned int op)
unsigned int tag, unsigned int op, u64 alloc_time_ns)
{
struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
struct request *rq = tags->static_rqs[tag];
......@@ -325,11 +325,15 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
RB_CLEAR_NODE(&rq->rb_node);
rq->rq_disk = NULL;
rq->part = NULL;
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
rq->alloc_time_ns = alloc_time_ns;
#endif
if (blk_mq_need_time_stamp(rq))
rq->start_time_ns = ktime_get_ns();
else
rq->start_time_ns = 0;
rq->io_start_time_ns = 0;
rq->stats_sectors = 0;
rq->nr_phys_segments = 0;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
rq->nr_integrity_segments = 0;
......@@ -356,8 +360,14 @@ static struct request *blk_mq_get_request(struct request_queue *q,
struct request *rq;
unsigned int tag;
bool clear_ctx_on_error = false;
u64 alloc_time_ns = 0;
blk_queue_enter_live(q);
/* alloc_time includes depth and tag waits */
if (blk_queue_rq_alloc_time(q))
alloc_time_ns = ktime_get_ns();
data->q = q;
if (likely(!data->ctx)) {
data->ctx = blk_mq_get_ctx(q);
......@@ -393,7 +403,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
return NULL;
}
rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags);
rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
if (!op_is_flush(data->cmd_flags)) {
rq->elv.icq = NULL;
if (e && e->type->ops.prepare_request) {
......@@ -652,19 +662,18 @@ bool blk_mq_complete_request(struct request *rq)
}
EXPORT_SYMBOL(blk_mq_complete_request);
void blk_mq_complete_request_sync(struct request *rq)
{
WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
rq->q->mq_ops->complete(rq);
}
EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
int blk_mq_request_started(struct request *rq)
{
return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
}
EXPORT_SYMBOL_GPL(blk_mq_request_started);
int blk_mq_request_completed(struct request *rq)
{
return blk_mq_rq_state(rq) == MQ_RQ_COMPLETE;
}
EXPORT_SYMBOL_GPL(blk_mq_request_completed);
void blk_mq_start_request(struct request *rq)
{
struct request_queue *q = rq->q;
......@@ -673,9 +682,7 @@ void blk_mq_start_request(struct request *rq)
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
rq->io_start_time_ns = ktime_get_ns();
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
rq->throtl_size = blk_rq_sectors(rq);
#endif
rq->stats_sectors = blk_rq_sectors(rq);
rq->rq_flags |= RQF_STATS;
rq_qos_issue(q, rq);
}
......@@ -2453,11 +2460,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
/*
* Avoid others reading imcomplete hctx->cpumask through sysfs
*/
mutex_lock(&q->sysfs_lock);
queue_for_each_hw_ctx(q, hctx, i) {
cpumask_clear(hctx->cpumask);
hctx->nr_ctx = 0;
......@@ -2518,8 +2520,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
HCTX_TYPE_DEFAULT, i);
}
mutex_unlock(&q->sysfs_lock);
queue_for_each_hw_ctx(q, hctx, i) {
/*
* If no software queues are mapped to this hardware queue,
......@@ -2688,7 +2688,11 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
if (!uninit_q)
return ERR_PTR(-ENOMEM);
q = blk_mq_init_allocated_queue(set, uninit_q);
/*
* Initialize the queue without an elevator. device_add_disk() will do
* the initialization.
*/
q = blk_mq_init_allocated_queue(set, uninit_q, false);
if (IS_ERR(q))
blk_cleanup_queue(uninit_q);
......@@ -2839,7 +2843,8 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
}
struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
struct request_queue *q)
struct request_queue *q,
bool elevator_init)
{
/* mark the queue as mq asap */
q->mq_ops = set->ops;
......@@ -2901,18 +2906,14 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
blk_mq_add_queue_tag_set(set, q);
blk_mq_map_swqueue(q);
if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
int ret;
ret = elevator_init_mq(q);
if (ret)
return ERR_PTR(ret);
}
if (elevator_init)
elevator_init_mq(q);
return q;
err_hctxs:
kfree(q->queue_hw_ctx);
q->nr_hw_queues = 0;
err_sys_init:
blk_mq_sysfs_deinit(q);
err_poll:
......
......@@ -207,10 +207,12 @@ EXPORT_SYMBOL(blk_post_runtime_resume);
*/
void blk_set_runtime_active(struct request_queue *q)
{
spin_lock_irq(&q->queue_lock);
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
pm_request_autosuspend(q->dev);
spin_unlock_irq(&q->queue_lock);
if (q->dev) {
spin_lock_irq(&q->queue_lock);
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
pm_request_autosuspend(q->dev);
spin_unlock_irq(&q->queue_lock);
}
}
EXPORT_SYMBOL(blk_set_runtime_active);
......@@ -83,6 +83,15 @@ void __rq_qos_track(struct rq_qos *rqos, struct request *rq, struct bio *bio)
} while (rqos);
}
void __rq_qos_merge(struct rq_qos *rqos, struct request *rq, struct bio *bio)
{
do {
if (rqos->ops->merge)
rqos->ops->merge(rqos, rq, bio);
rqos = rqos->next;
} while (rqos);
}
void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio)
{
do {
......@@ -92,6 +101,15 @@ void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio)
} while (rqos);
}
void __rq_qos_queue_depth_changed(struct rq_qos *rqos)
{
do {
if (rqos->ops->queue_depth_changed)
rqos->ops->queue_depth_changed(rqos);
rqos = rqos->next;
} while (rqos);
}
/*
* Return true, if we can't increase the depth further by scaling
*/
......
......@@ -14,7 +14,8 @@ struct blk_mq_debugfs_attr;
enum rq_qos_id {
RQ_QOS_WBT,
RQ_QOS_CGROUP,
RQ_QOS_LATENCY,
RQ_QOS_COST,
};
struct rq_wait {
......@@ -35,11 +36,13 @@ struct rq_qos {
struct rq_qos_ops {
void (*throttle)(struct rq_qos *, struct bio *);
void (*track)(struct rq_qos *, struct request *, struct bio *);
void (*merge)(struct rq_qos *, struct request *, struct bio *);
void (*issue)(struct rq_qos *, struct request *);
void (*requeue)(struct rq_qos *, struct request *);
void (*done)(struct rq_qos *, struct request *);
void (*done_bio)(struct rq_qos *, struct bio *);
void (*cleanup)(struct rq_qos *, struct bio *);
void (*queue_depth_changed)(struct rq_qos *);
void (*exit)(struct rq_qos *);
const struct blk_mq_debugfs_attr *debugfs_attrs;
};
......@@ -72,7 +75,7 @@ static inline struct rq_qos *wbt_rq_qos(struct request_queue *q)
static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
{
return rq_qos_id(q, RQ_QOS_CGROUP);
return rq_qos_id(q, RQ_QOS_LATENCY);
}
static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
......@@ -80,8 +83,10 @@ static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
switch (id) {
case RQ_QOS_WBT:
return "wbt";
case RQ_QOS_CGROUP:
return "cgroup";
case RQ_QOS_LATENCY:
return "latency";
case RQ_QOS_COST:
return "cost";
}
return "unknown";
}
......@@ -135,7 +140,9 @@ void __rq_qos_issue(struct rq_qos *rqos, struct request *rq);
void __rq_qos_requeue(struct rq_qos *rqos, struct request *rq);
void __rq_qos_throttle(struct rq_qos *rqos, struct bio *bio);
void __rq_qos_track(struct rq_qos *rqos, struct request *rq, struct bio *bio);
void __rq_qos_merge(struct rq_qos *rqos, struct request *rq, struct bio *bio);
void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio);
void __rq_qos_queue_depth_changed(struct rq_qos *rqos);
static inline void rq_qos_cleanup(struct request_queue *q, struct bio *bio)
{
......@@ -185,6 +192,19 @@ static inline void rq_qos_track(struct request_queue *q, struct request *rq,
__rq_qos_track(q->rq_qos, rq, bio);
}
static inline void rq_qos_merge(struct request_queue *q, struct request *rq,
struct bio *bio)
{
if (q->rq_qos)
__rq_qos_merge(q->rq_qos, rq, bio);
}
static inline void rq_qos_queue_depth_changed(struct request_queue *q)
{
if (q->rq_qos)
__rq_qos_queue_depth_changed(q->rq_qos);
}
void rq_qos_exit(struct request_queue *);
#endif
......@@ -805,7 +805,7 @@ EXPORT_SYMBOL(blk_queue_update_dma_alignment);
void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
{
q->queue_depth = depth;
wbt_set_queue_depth(q, depth);
rq_qos_queue_depth_changed(q);
}
EXPORT_SYMBOL(blk_set_queue_depth);
......@@ -832,6 +832,22 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
}
EXPORT_SYMBOL_GPL(blk_queue_write_cache);
/**
* blk_queue_required_elevator_features - Set a queue required elevator features
* @q: the request queue for the target device
* @features: Required elevator features OR'ed together
*
* Tell the block layer that for the device controlled through @q, only the
* only elevators that can be used are those that implement at least the set of
* features specified by @features.
*/
void blk_queue_required_elevator_features(struct request_queue *q,
unsigned int features)
{
q->required_elevator_features = features;
}
EXPORT_SYMBOL_GPL(blk_queue_required_elevator_features);
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
......
......@@ -941,14 +941,14 @@ int blk_register_queue(struct gendisk *disk)
int ret;
struct device *dev = disk_to_dev(disk);
struct request_queue *q = disk->queue;
bool has_elevator = false;
if (WARN_ON(!q))
return -ENXIO;
WARN_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags),
WARN_ONCE(blk_queue_registered(q),
"%s is registering an already registered queue\n",
kobject_name(&dev->kobj));
blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
/*
* SCSI probing may synchronously create and destroy a lot of
......@@ -968,8 +968,7 @@ int blk_register_queue(struct gendisk *disk)
if (ret)
return ret;
/* Prevent changes through sysfs until registration is completed. */
mutex_lock(&q->sysfs_lock);
mutex_lock(&q->sysfs_dir_lock);
ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
if (ret < 0) {
......@@ -990,26 +989,36 @@ int blk_register_queue(struct gendisk *disk)
blk_mq_debugfs_register(q);
}
kobject_uevent(&q->kobj, KOBJ_ADD);
wbt_enable_default(q);
blk_throtl_register_queue(q);
/*
* The flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
* switch won't happen at all.
*/
if (q->elevator) {
ret = elv_register_queue(q);
ret = elv_register_queue(q, false);
if (ret) {
mutex_unlock(&q->sysfs_lock);
kobject_uevent(&q->kobj, KOBJ_REMOVE);
mutex_unlock(&q->sysfs_dir_lock);
kobject_del(&q->kobj);
blk_trace_remove_sysfs(dev);
kobject_put(&dev->kobj);
return ret;
}
has_elevator = true;
}
mutex_lock(&q->sysfs_lock);
blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
wbt_enable_default(q);
blk_throtl_register_queue(q);
/* Now everything is ready and send out KOBJ_ADD uevent */
kobject_uevent(&q->kobj, KOBJ_ADD);
if (has_elevator)
kobject_uevent(&q->elevator->kobj, KOBJ_ADD);
mutex_unlock(&q->sysfs_lock);
ret = 0;
unlock:
mutex_unlock(&q->sysfs_lock);
mutex_unlock(&q->sysfs_dir_lock);
return ret;
}
EXPORT_SYMBOL_GPL(blk_register_queue);
......@@ -1029,7 +1038,7 @@ void blk_unregister_queue(struct gendisk *disk)
return;
/* Return early if disk->queue was never registered. */
if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
if (!blk_queue_registered(q))
return;
/*
......@@ -1038,25 +1047,28 @@ void blk_unregister_queue(struct gendisk *disk)
* concurrent elv_iosched_store() calls.
*/
mutex_lock(&q->sysfs_lock);
blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
mutex_unlock(&q->sysfs_lock);
mutex_lock(&q->sysfs_dir_lock);
/*
* Remove the sysfs attributes before unregistering the queue data
* structures that can be modified through sysfs.
*/
if (queue_is_mq(q))
blk_mq_unregister_dev(disk_to_dev(disk), q);
mutex_unlock(&q->sysfs_lock);
kobject_uevent(&q->kobj, KOBJ_REMOVE);
kobject_del(&q->kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
mutex_lock(&q->sysfs_lock);
/*
* q->kobj has been removed, so it is safe to check if elevator
* exists without holding q->sysfs_lock.
*/
if (q->elevator)
elv_unregister_queue(q);
mutex_unlock(&q->sysfs_lock);
mutex_unlock(&q->sysfs_dir_lock);
kobject_put(&disk_to_dev(disk)->kobj);
}
......@@ -478,12 +478,14 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
timer_setup(&sq->pending_timer, throtl_pending_timer_fn, 0);
}
static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp,
struct request_queue *q,
struct blkcg *blkcg)
{
struct throtl_grp *tg;
int rw;
tg = kzalloc_node(sizeof(*tg), gfp, node);
tg = kzalloc_node(sizeof(*tg), gfp, q->node);
if (!tg)
return NULL;
......@@ -2246,7 +2248,8 @@ void blk_throtl_stat_add(struct request *rq, u64 time_ns)
struct request_queue *q = rq->q;
struct throtl_data *td = q->td;
throtl_track_latency(td, rq->throtl_size, req_op(rq), time_ns >> 10);
throtl_track_latency(td, blk_rq_stats_sectors(rq), req_op(rq),
time_ns >> 10);
}
void blk_throtl_bio_endio(struct bio *bio)
......
......@@ -629,15 +629,6 @@ static void wbt_requeue(struct rq_qos *rqos, struct request *rq)
}
}
void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
{
struct rq_qos *rqos = wbt_rq_qos(q);
if (rqos) {
RQWB(rqos)->rq_depth.queue_depth = depth;
__wbt_update_limits(RQWB(rqos));
}
}
void wbt_set_write_cache(struct request_queue *q, bool write_cache_on)
{
struct rq_qos *rqos = wbt_rq_qos(q);
......@@ -656,7 +647,7 @@ void wbt_enable_default(struct request_queue *q)
return;
/* Queue not registered? Maybe shutting down... */
if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
if (!blk_queue_registered(q))
return;
if (queue_is_mq(q) && IS_ENABLED(CONFIG_BLK_WBT_MQ))
......@@ -689,6 +680,12 @@ static int wbt_data_dir(const struct request *rq)
return -1;
}
static void wbt_queue_depth_changed(struct rq_qos *rqos)
{
RQWB(rqos)->rq_depth.queue_depth = blk_queue_depth(rqos->q);
__wbt_update_limits(RQWB(rqos));
}
static void wbt_exit(struct rq_qos *rqos)
{
struct rq_wb *rwb = RQWB(rqos);
......@@ -811,6 +808,7 @@ static struct rq_qos_ops wbt_rqos_ops = {
.requeue = wbt_requeue,
.done = wbt_done,
.cleanup = wbt_cleanup,
.queue_depth_changed = wbt_queue_depth_changed,
.exit = wbt_exit,
#ifdef CONFIG_BLK_DEBUG_FS
.debugfs_attrs = wbt_debugfs_attrs,
......@@ -853,7 +851,7 @@ int wbt_init(struct request_queue *q)
rwb->min_lat_nsec = wbt_default_latency_nsec(q);
wbt_set_queue_depth(q, blk_queue_depth(q));
wbt_queue_depth_changed(&rwb->rqos);
wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
return 0;
......
......@@ -95,7 +95,6 @@ void wbt_enable_default(struct request_queue *);
u64 wbt_get_min_lat(struct request_queue *q);
void wbt_set_min_lat(struct request_queue *q, u64 val);
void wbt_set_queue_depth(struct request_queue *, unsigned int);
void wbt_set_write_cache(struct request_queue *, bool);
u64 wbt_default_latency_nsec(struct request_queue *);
......@@ -118,9 +117,6 @@ static inline void wbt_disable_default(struct request_queue *q)
static inline void wbt_enable_default(struct request_queue *q)
{
}
static inline void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
{
}
static inline void wbt_set_write_cache(struct request_queue *q, bool wc)
{
}
......
......@@ -202,6 +202,42 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(blkdev_report_zones);
/*
* Special case of zone reset operation to reset all zones in one command,
* useful for applications like mkfs.
*/
static int __blkdev_reset_all_zones(struct block_device *bdev, gfp_t gfp_mask)
{
struct bio *bio = bio_alloc(gfp_mask, 0);
int ret;
/* across the zones operations, don't need any sectors */
bio_set_dev(bio, bdev);
bio_set_op_attrs(bio, REQ_OP_ZONE_RESET_ALL, 0);
ret = submit_bio_wait(bio);
bio_put(bio);
return ret;
}
static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
sector_t nr_sectors)
{
if (!blk_queue_zone_resetall(bdev_get_queue(bdev)))
return false;
if (nr_sectors != part_nr_sects_read(bdev->bd_part))
return false;
/*
* REQ_OP_ZONE_RESET_ALL can be executed only if the block device is
* the entire disk, that is, if the blocks device start offset is 0 and
* its capacity is the same as the entire disk.
*/
return get_start_sect(bdev) == 0 &&
part_nr_sects_read(bdev->bd_part) == get_capacity(bdev->bd_disk);
}
/**
* blkdev_reset_zones - Reset zones write pointer
* @bdev: Target block device
......@@ -235,6 +271,9 @@ int blkdev_reset_zones(struct block_device *bdev,
/* Out of range */
return -EINVAL;
if (blkdev_allow_reset_all_zones(bdev, nr_sectors))
return __blkdev_reset_all_zones(bdev, gfp_mask);
/* Check alignment (handle eventual smaller last zone) */
zone_sectors = blk_queue_zone_sectors(q);
if (sector & (zone_sectors - 1))
......
......@@ -184,11 +184,11 @@ void blk_account_io_done(struct request *req, u64 now);
void blk_insert_flush(struct request *rq);
int elevator_init_mq(struct request_queue *q);
void elevator_init_mq(struct request_queue *q);
int elevator_switch_mq(struct request_queue *q,
struct elevator_type *new_e);
void __elevator_exit(struct request_queue *, struct elevator_queue *);
int elv_register_queue(struct request_queue *q);
int elv_register_queue(struct request_queue *q, bool uevent);
void elv_unregister_queue(struct request_queue *q);
static inline void elevator_exit(struct request_queue *q,
......
This diff is collapsed.
......@@ -695,6 +695,15 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
dev_t devt;
int retval;
/*
* The disk queue should now be all set with enough information about
* the device for the elevator code to pick an adequate default
* elevator if one is needed, that is, for devices requesting queue
* registration.
*/
if (register_queue)
elevator_init_mq(disk->queue);
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
* parameters make sense.
......
......@@ -377,13 +377,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
* hardware queue, but we may return a request that is for a
* different hardware queue. This is because mq-deadline has shared
* state for all hardware queues, in terms of sorting, FIFOs, etc.
*
* For a zoned block device, __dd_dispatch_request() may return NULL
* if all the queued write requests are directed at zones that are already
* locked due to on-going write requests. In this case, make sure to mark
* the queue as needing a restart to ensure that the queue is run again
* and the pending writes dispatched once the target zones for the ongoing
* write requests are unlocked in dd_finish_request().
*/
static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
......@@ -392,9 +385,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
spin_lock(&dd->lock);
rq = __dd_dispatch_request(dd);
if (!rq && blk_queue_is_zoned(hctx->queue) &&
!list_empty(&dd->fifo_list[WRITE]))
blk_mq_sched_mark_restart_hctx(hctx);
spin_unlock(&dd->lock);
return rq;
......@@ -561,6 +551,13 @@ static void dd_prepare_request(struct request *rq, struct bio *bio)
* spinlock so that the zone is never unlocked while deadline_fifo_request()
* or deadline_next_request() are executing. This function is called for
* all requests, whether or not these requests complete successfully.
*
* For a zoned block device, __dd_dispatch_request() may have stopped
* dispatching requests if all the queued requests are write requests directed
* at zones that are already locked due to on-going write requests. To ensure
* write request dispatch progress in this case, mark the queue as needing a
* restart to ensure that the queue is run again after completion of the
* request and zones being unlocked.
*/
static void dd_finish_request(struct request *rq)
{
......@@ -572,6 +569,8 @@ static void dd_finish_request(struct request *rq)
spin_lock_irqsave(&dd->zone_lock, flags);
blk_req_zone_write_unlock(rq);
if (!list_empty(&dd->fifo_list[WRITE]))
blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
spin_unlock_irqrestore(&dd->zone_lock, flags);
}
}
......@@ -795,6 +794,7 @@ static struct elevator_type mq_deadline = {
.elevator_attrs = deadline_attrs,
.elevator_name = "mq-deadline",
.elevator_alias = "deadline",
.elevator_features = ELEVATOR_F_ZBD_SEQ_WRITE,
.elevator_owner = THIS_MODULE,
};
MODULE_ALIAS("mq-deadline-iosched");
......
......@@ -119,8 +119,6 @@ enum opal_uid {
OPAL_UID_HEXFF,
};
#define OPAL_METHOD_LENGTH 8
/* Enum for indexing the OPALMETHOD array */
enum opal_method {
OPAL_PROPERTIES,
......@@ -167,7 +165,6 @@ enum opal_token {
OPAL_TABLE_LASTID = 0x0A,
OPAL_TABLE_MIN = 0x0B,
OPAL_TABLE_MAX = 0x0C,
/* authority table */
OPAL_PIN = 0x03,
/* locking tokens */
......@@ -182,7 +179,7 @@ enum opal_token {
OPAL_LIFECYCLE = 0x06,
/* locking info table */
OPAL_MAXRANGES = 0x04,
/* mbr control */
/* mbr control */
OPAL_MBRENABLE = 0x01,
OPAL_MBRDONE = 0x02,
/* properties */
......
This diff is collapsed.
......@@ -3780,7 +3780,7 @@ static int compat_getdrvprm(int drive,
v.native_format = UDP->native_format;
mutex_unlock(&floppy_mutex);
if (copy_from_user(arg, &v, sizeof(struct compat_floppy_drive_params)))
if (copy_to_user(arg, &v, sizeof(struct compat_floppy_drive_params)))
return -EFAULT;
return 0;
}
......@@ -3816,7 +3816,7 @@ static int compat_getdrvstat(int drive, bool poll,
v.bufblocks = UDRS->bufblocks;
mutex_unlock(&floppy_mutex);
if (copy_from_user(arg, &v, sizeof(struct compat_floppy_drive_struct)))
if (copy_to_user(arg, &v, sizeof(struct compat_floppy_drive_struct)))
return -EFAULT;
return 0;
Eintr:
......
......@@ -1755,6 +1755,7 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
case LOOP_SET_FD:
case LOOP_CHANGE_FD:
case LOOP_SET_BLOCK_SIZE:
case LOOP_SET_DIRECT_IO:
err = lo_ioctl(bdev, mode, cmd, arg);
break;
default:
......
This diff is collapsed.
......@@ -2,6 +2,9 @@
#ifndef __BLK_NULL_BLK_H
#define __BLK_NULL_BLK_H
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/blkdev.h>
#include <linux/slab.h>
#include <linux/blk-mq.h>
......@@ -90,13 +93,13 @@ int null_zone_init(struct nullb_device *dev);
void null_zone_exit(struct nullb_device *dev);
int null_zone_report(struct gendisk *disk, sector_t sector,
struct blk_zone *zones, unsigned int *nr_zones);
void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
unsigned int nr_sectors);
void null_zone_reset(struct nullb_cmd *cmd, sector_t sector);
blk_status_t null_handle_zoned(struct nullb_cmd *cmd,
enum req_opf op, sector_t sector,
sector_t nr_sectors);
#else
static inline int null_zone_init(struct nullb_device *dev)
{
pr_err("null_blk: CONFIG_BLK_DEV_ZONED not enabled\n");
pr_err("CONFIG_BLK_DEV_ZONED not enabled\n");
return -EINVAL;
}
static inline void null_zone_exit(struct nullb_device *dev) {}
......@@ -106,10 +109,11 @@ static inline int null_zone_report(struct gendisk *disk, sector_t sector,
{
return -EOPNOTSUPP;
}
static inline void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
unsigned int nr_sectors)
static inline blk_status_t null_handle_zoned(struct nullb_cmd *cmd,
enum req_opf op, sector_t sector,
sector_t nr_sectors)
{
return BLK_STS_NOTSUPP;
}
static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {}
#endif /* CONFIG_BLK_DEV_ZONED */
#endif /* __NULL_BLK_H */
This diff is collapsed.
......@@ -17,7 +17,7 @@ int null_zone_init(struct nullb_device *dev)
unsigned int i;
if (!is_power_of_2(dev->zone_size)) {
pr_err("null_blk: zone_size must be power-of-two\n");
pr_err("zone_size must be power-of-two\n");
return -EINVAL;
}
......@@ -31,7 +31,7 @@ int null_zone_init(struct nullb_device *dev)
if (dev->zone_nr_conv >= dev->nr_zones) {
dev->zone_nr_conv = dev->nr_zones - 1;
pr_info("null_blk: changed the number of conventional zones to %u",
pr_info("changed the number of conventional zones to %u",
dev->zone_nr_conv);
}
......@@ -84,7 +84,7 @@ int null_zone_report(struct gendisk *disk, sector_t sector,
return 0;
}
void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
unsigned int nr_sectors)
{
struct nullb_device *dev = cmd->nq->dev;
......@@ -95,14 +95,12 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
case BLK_ZONE_COND_FULL:
/* Cannot write to a full zone */
cmd->error = BLK_STS_IOERR;
break;
return BLK_STS_IOERR;
case BLK_ZONE_COND_EMPTY:
case BLK_ZONE_COND_IMP_OPEN:
/* Writes must be at the write pointer position */
if (sector != zone->wp) {
cmd->error = BLK_STS_IOERR;
break;
}
if (sector != zone->wp)
return BLK_STS_IOERR;
if (zone->cond == BLK_ZONE_COND_EMPTY)
zone->cond = BLK_ZONE_COND_IMP_OPEN;
......@@ -115,22 +113,51 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
break;
default:
/* Invalid zone condition */
cmd->error = BLK_STS_IOERR;
break;
return BLK_STS_IOERR;
}
return BLK_STS_OK;
}
void null_zone_reset(struct nullb_cmd *cmd, sector_t sector)
static blk_status_t null_zone_reset(struct nullb_cmd *cmd, sector_t sector)
{
struct nullb_device *dev = cmd->nq->dev;
unsigned int zno = null_zone_no(dev, sector);
struct blk_zone *zone = &dev->zones[zno];
size_t i;
switch (req_op(cmd->rq)) {
case REQ_OP_ZONE_RESET_ALL:
for (i = 0; i < dev->nr_zones; i++) {
if (zone[i].type == BLK_ZONE_TYPE_CONVENTIONAL)
continue;
zone[i].cond = BLK_ZONE_COND_EMPTY;
zone[i].wp = zone[i].start;
}
break;
case REQ_OP_ZONE_RESET:
if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
return BLK_STS_IOERR;
if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) {
cmd->error = BLK_STS_IOERR;
return;
zone->cond = BLK_ZONE_COND_EMPTY;
zone->wp = zone->start;
break;
default:
cmd->error = BLK_STS_NOTSUPP;
break;
}
return BLK_STS_OK;
}
zone->cond = BLK_ZONE_COND_EMPTY;
zone->wp = zone->start;
blk_status_t null_handle_zoned(struct nullb_cmd *cmd, enum req_opf op,
sector_t sector, sector_t nr_sectors)
{
switch (op) {
case REQ_OP_WRITE:
return null_zone_write(cmd, sector, nr_sectors);
case REQ_OP_ZONE_RESET:
case REQ_OP_ZONE_RESET_ALL:
return null_zone_reset(cmd, sector);
default:
return BLK_STS_OK;
}
}
......@@ -314,8 +314,8 @@ static void pcd_init_units(void)
disk->queue = blk_mq_init_sq_queue(&cd->tag_set, &pcd_mq_ops,
1, BLK_MQ_F_SHOULD_MERGE);
if (IS_ERR(disk->queue)) {
put_disk(disk);
disk->queue = NULL;
put_disk(disk);
continue;
}
......@@ -723,9 +723,9 @@ static int pcd_detect(void)
k = 0;
if (pcd_drive_count == 0) { /* nothing spec'd - so autoprobe for 1 */
cd = pcd;
if (pi_init(cd->pi, 1, -1, -1, -1, -1, -1, pcd_buffer,
PI_PCD, verbose, cd->name)) {
if (!pcd_probe(cd, -1, id) && cd->disk) {
if (cd->disk && pi_init(cd->pi, 1, -1, -1, -1, -1, -1,
pcd_buffer, PI_PCD, verbose, cd->name)) {
if (!pcd_probe(cd, -1, id)) {
cd->present = 1;
k++;
} else
......@@ -736,11 +736,13 @@ static int pcd_detect(void)
int *conf = *drives[unit];
if (!conf[D_PRT])
continue;
if (!cd->disk)
continue;
if (!pi_init(cd->pi, 0, conf[D_PRT], conf[D_MOD],
conf[D_UNI], conf[D_PRO], conf[D_DLY],
pcd_buffer, PI_PCD, verbose, cd->name))
continue;
if (!pcd_probe(cd, conf[D_SLV], id) && cd->disk) {
if (!pcd_probe(cd, conf[D_SLV], id)) {
cd->present = 1;
k++;
} else
......
......@@ -300,8 +300,8 @@ static void __init pf_init_units(void)
disk->queue = blk_mq_init_sq_queue(&pf->tag_set, &pf_mq_ops,
1, BLK_MQ_F_SHOULD_MERGE);
if (IS_ERR(disk->queue)) {
put_disk(disk);
disk->queue = NULL;
put_disk(disk);
continue;
}
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -105,8 +105,14 @@ struct closure_syncer {
static void closure_sync_fn(struct closure *cl)
{
cl->s->done = 1;
wake_up_process(cl->s->task);
struct closure_syncer *s = cl->s;
struct task_struct *p;
rcu_read_lock();
p = READ_ONCE(s->task);
s->done = 1;
wake_up_process(p);
rcu_read_unlock();
}
void __sched __closure_sync(struct closure *cl)
......
......@@ -178,10 +178,9 @@ static ssize_t bch_dump_read(struct file *file, char __user *buf,
while (size) {
struct keybuf_key *w;
unsigned int bytes = min(i->bytes, size);
int err = copy_to_user(buf, i->buf, bytes);
if (err)
return err;
if (copy_to_user(buf, i->buf, bytes))
return -EFAULT;
ret += bytes;
buf += bytes;
......
......@@ -964,6 +964,7 @@ KTYPE(bch_cache_set_internal);
static int __bch_cache_cmp(const void *l, const void *r)
{
cond_resched();
return *((uint16_t *)r) - *((uint16_t *)l);
}
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment