Commit 69475292 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
parents a351e9b9 9438b3e0
...@@ -213,14 +213,8 @@ What: /sys/block/<disk>/queue/discard_zeroes_data ...@@ -213,14 +213,8 @@ What: /sys/block/<disk>/queue/discard_zeroes_data
Date: May 2011 Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com> Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description: Description:
Devices that support discard functionality may return Will always return 0. Don't rely on any specific behavior
stale or random data when a previously discarded block for discards, and don't read this file.
is read back. This can cause problems if the filesystem
expects discarded blocks to be explicitly cleared. If a
device reports that it deterministically returns zeroes
when a discarded area is read the discard_zeroes_data
parameter will be set to one. Otherwise it will be 0 and
the result of reading a discarded area is undefined.
What: /sys/block/<disk>/queue/write_same_max_bytes What: /sys/block/<disk>/queue/write_same_max_bytes
Date: January 2012 Date: January 2012
......
00-INDEX 00-INDEX
- This file - This file
bfq-iosched.txt
- BFQ IO scheduler and its tunables
biodoc.txt biodoc.txt
- Notes on the Generic Block Layer Rewrite in Linux 2.5 - Notes on the Generic Block Layer Rewrite in Linux 2.5
biovecs.txt biovecs.txt
......
This diff is collapsed.
Kyber I/O scheduler tunables
===========================
The only two tunables for the Kyber scheduler are the target latencies for
reads and synchronous writes. Kyber will throttle requests in order to meet
these target latencies.
read_lat_nsec
-------------
Target latency for reads (in nanoseconds).
write_lat_nsec
--------------
Target latency for synchronous writes (in nanoseconds).
...@@ -43,11 +43,6 @@ large discards are issued, setting this value lower will make Linux issue ...@@ -43,11 +43,6 @@ large discards are issued, setting this value lower will make Linux issue
smaller discards and potentially help reduce latencies induced by large smaller discards and potentially help reduce latencies induced by large
discard operations. discard operations.
discard_zeroes_data (RO)
------------------------
When read, this file will show if the discarded block are zeroed by the
device or not. If its value is '1' the blocks are zeroed otherwise not.
hw_sector_size (RO) hw_sector_size (RO)
------------------- -------------------
This is the hardware sector size of the device, in bytes. This is the hardware sector size of the device, in bytes.
...@@ -192,5 +187,11 @@ scaling back writes. Writing a value of '0' to this file disables the ...@@ -192,5 +187,11 @@ scaling back writes. Writing a value of '0' to this file disables the
feature. Writing a value of '-1' to this file resets the value to the feature. Writing a value of '-1' to this file resets the value to the
default setting. default setting.
throttle_sample_time (RW)
-------------------------
This is the time window that blk-throttle samples data, in millisecond.
blk-throttle makes decision based on the samplings. Lower time means cgroups
have more smooth throughput, but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.
Jens Axboe <jens.axboe@oracle.com>, February 2009 Jens Axboe <jens.axboe@oracle.com>, February 2009
This document describes m[g]flash support in linux.
Contents
1. Overview
2. Reserved area configuration
3. Example of mflash platform driver registration
1. Overview
Mflash and gflash are embedded flash drive. The only difference is mflash is
MCP(Multi Chip Package) device. These two device operate exactly same way.
So the rest mflash repersents mflash and gflash altogether.
Internally, mflash has nand flash and other hardware logics and supports
2 different operation (ATA, IO) modes. ATA mode doesn't need any new
driver and currently works well under standard IDE subsystem. Actually it's
one chip SSD. IO mode is ATA-like custom mode for the host that doesn't have
IDE interface.
Following are brief descriptions about IO mode.
A. IO mode based on ATA protocol and uses some custom command. (read confirm,
write confirm)
B. IO mode uses SRAM bus interface.
C. IO mode supports 4kB boot area, so host can boot from mflash.
2. Reserved area configuration
If host boot from mflash, usually needs raw area for boot loader image. All of
the mflash's block device operation will be taken this value as start offset.
Note that boot loader's size of reserved area and kernel configuration value
must be same.
3. Example of mflash platform driver registration
Working mflash is very straight forward. Adding platform device stuff to board
configuration file is all. Here is some pseudo example.
static struct mg_drv_data mflash_drv_data = {
/* If you want to polling driver set to 1 */
.use_polling = 0,
/* device attribution */
.dev_attr = MG_BOOT_DEV
};
static struct resource mg_mflash_rsc[] = {
/* Base address of mflash */
[0] = {
.start = 0x08000000,
.end = 0x08000000 + SZ_64K - 1,
.flags = IORESOURCE_MEM
},
/* mflash interrupt pin */
[1] = {
.start = IRQ_GPIO(84),
.end = IRQ_GPIO(84),
.flags = IORESOURCE_IRQ
},
/* mflash reset pin */
[2] = {
.start = 43,
.end = 43,
.name = MG_RST_PIN,
.flags = IORESOURCE_IO
},
/* mflash reset-out pin
* If you use mflash as storage device (i.e. other than MG_BOOT_DEV),
* should assign this */
[3] = {
.start = 51,
.end = 51,
.name = MG_RSTOUT_PIN,
.flags = IORESOURCE_IO
}
};
static struct platform_device mflash_dev = {
.name = MG_DEV_NAME,
.id = -1,
.dev = {
.platform_data = &mflash_drv_data,
},
.num_resources = ARRAY_SIZE(mg_mflash_rsc),
.resource = mg_mflash_rsc
};
platform_device_register(&mflash_dev);
pblk: Physical Block Device Target
==================================
pblk implements a fully associative, host-based FTL that exposes a traditional
block I/O interface. Its primary responsibilities are:
- Map logical addresses onto physical addresses (4KB granularity) in a
logical-to-physical (L2P) table.
- Maintain the integrity and consistency of the L2P table as well as its
recovery from normal tear down and power outage.
- Deal with controller- and media-specific constrains.
- Handle I/O errors.
- Implement garbage collection.
- Maintain consistency across the I/O stack during synchronization points.
For more information please refer to:
http://lightnvm.io
which maintains updated FAQs, manual pages, technical documentation, tools,
contacts, etc.
...@@ -2544,6 +2544,14 @@ F: block/ ...@@ -2544,6 +2544,14 @@ F: block/
F: kernel/trace/blktrace.c F: kernel/trace/blktrace.c
F: lib/sbitmap.c F: lib/sbitmap.c
BFQ I/O SCHEDULER
M: Paolo Valente <paolo.valente@linaro.org>
M: Jens Axboe <axboe@kernel.dk>
L: linux-block@vger.kernel.org
S: Maintained
F: block/bfq-*
F: Documentation/block/bfq-iosched.txt
BLOCK2MTD DRIVER BLOCK2MTD DRIVER
M: Joern Engel <joern@lazybastard.org> M: Joern Engel <joern@lazybastard.org>
L: linux-mtd@lists.infradead.org L: linux-mtd@lists.infradead.org
......
...@@ -115,6 +115,18 @@ config BLK_DEV_THROTTLING ...@@ -115,6 +115,18 @@ config BLK_DEV_THROTTLING
See Documentation/cgroups/blkio-controller.txt for more information. See Documentation/cgroups/blkio-controller.txt for more information.
config BLK_DEV_THROTTLING_LOW
bool "Block throttling .low limit interface support (EXPERIMENTAL)"
depends on BLK_DEV_THROTTLING
default n
---help---
Add .low limit interface for block throttling. The low limit is a best
effort limit to prioritize cgroups. Depending on the setting, the limit
can be used to protect cgroups in terms of bandwidth/iops and better
utilize disk resource.
Note, this is an experimental interface and could be changed someday.
config BLK_CMDLINE_PARSER config BLK_CMDLINE_PARSER
bool "Block device command line partition parser" bool "Block device command line partition parser"
default n default n
......
...@@ -40,6 +40,7 @@ config CFQ_GROUP_IOSCHED ...@@ -40,6 +40,7 @@ config CFQ_GROUP_IOSCHED
Enable group IO scheduling in CFQ. Enable group IO scheduling in CFQ.
choice choice
prompt "Default I/O scheduler" prompt "Default I/O scheduler"
default DEFAULT_CFQ default DEFAULT_CFQ
help help
...@@ -69,6 +70,35 @@ config MQ_IOSCHED_DEADLINE ...@@ -69,6 +70,35 @@ config MQ_IOSCHED_DEADLINE
---help--- ---help---
MQ version of the deadline IO scheduler. MQ version of the deadline IO scheduler.
config MQ_IOSCHED_KYBER
tristate "Kyber I/O scheduler"
default y
---help---
The Kyber I/O scheduler is a low-overhead scheduler suitable for
multiqueue and other fast devices. Given target latencies for reads and
synchronous writes, it will self-tune queue depths to achieve that
goal.
config IOSCHED_BFQ
tristate "BFQ I/O scheduler"
default n
---help---
BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of
of the device among all processes according to their weights,
regardless of the device parameters and with any workload. It
also guarantees a low latency to interactive and soft
real-time applications. Details in
Documentation/block/bfq-iosched.txt
config BFQ_GROUP_IOSCHED
bool "BFQ hierarchical scheduling support"
depends on IOSCHED_BFQ && BLK_CGROUP
default n
---help---
Enable hierarchical scheduling in BFQ, using the blkio
(cgroups-v1) or io (cgroups-v2) controller.
endmenu endmenu
endif endif
...@@ -20,6 +20,9 @@ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o ...@@ -20,6 +20,9 @@ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
obj-$(CONFIG_IOSCHED_BFQ) += bfq.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -30,6 +30,7 @@ ...@@ -30,6 +30,7 @@
#include <linux/cgroup.h> #include <linux/cgroup.h>
#include <trace/events/block.h> #include <trace/events/block.h>
#include "blk.h"
/* /*
* Test patch to inline a certain number of bi_io_vec's inside the bio * Test patch to inline a certain number of bi_io_vec's inside the bio
...@@ -427,7 +428,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs) ...@@ -427,7 +428,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
* RETURNS: * RETURNS:
* Pointer to new bio on success, NULL on failure. * Pointer to new bio on success, NULL on failure.
*/ */
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
struct bio_set *bs)
{ {
gfp_t saved_gfp = gfp_mask; gfp_t saved_gfp = gfp_mask;
unsigned front_pad; unsigned front_pad;
...@@ -1824,6 +1826,11 @@ static inline bool bio_remaining_done(struct bio *bio) ...@@ -1824,6 +1826,11 @@ static inline bool bio_remaining_done(struct bio *bio)
* bio_endio() will end I/O on the whole bio. bio_endio() is the preferred * bio_endio() will end I/O on the whole bio. bio_endio() is the preferred
* way to end I/O on a bio. No one should call bi_end_io() directly on a * way to end I/O on a bio. No one should call bi_end_io() directly on a
* bio unless they own it and thus know that it has an end_io function. * bio unless they own it and thus know that it has an end_io function.
*
* bio_endio() can be called several times on a bio that has been chained
* using bio_chain(). The ->bi_end_io() function will only be called the
* last time. At this point the BLK_TA_COMPLETE tracing event will be
* generated if BIO_TRACE_COMPLETION is set.
**/ **/
void bio_endio(struct bio *bio) void bio_endio(struct bio *bio)
{ {
...@@ -1844,6 +1851,13 @@ void bio_endio(struct bio *bio) ...@@ -1844,6 +1851,13 @@ void bio_endio(struct bio *bio)
goto again; goto again;
} }
if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
bio, bio->bi_error);
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
}
blk_throtl_bio_endio(bio);
if (bio->bi_end_io) if (bio->bi_end_io)
bio->bi_end_io(bio); bio->bi_end_io(bio);
} }
...@@ -1882,6 +1896,9 @@ struct bio *bio_split(struct bio *bio, int sectors, ...@@ -1882,6 +1896,9 @@ struct bio *bio_split(struct bio *bio, int sectors,
bio_advance(bio, split->bi_iter.bi_size); bio_advance(bio, split->bi_iter.bi_size);
if (bio_flagged(bio, BIO_TRACE_COMPLETION))
bio_set_flag(bio, BIO_TRACE_COMPLETION);
return split; return split;
} }
EXPORT_SYMBOL(bio_split); EXPORT_SYMBOL(bio_split);
......
...@@ -772,6 +772,27 @@ struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, ...@@ -772,6 +772,27 @@ struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg,
} }
EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum); EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);
/* Performs queue bypass and policy enabled checks then looks up blkg. */
static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg,
const struct blkcg_policy *pol,
struct request_queue *q)
{
WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
if (!blkcg_policy_enabled(q, pol))
return ERR_PTR(-EOPNOTSUPP);
/*
* This could be the first entry point of blkcg implementation and
* we shouldn't allow anything to go through for a bypassing queue.
*/
if (unlikely(blk_queue_bypass(q)))
return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
return __blkg_lookup(blkcg, q, true /* update_hint */);
}
/** /**
* blkg_conf_prep - parse and prepare for per-blkg config update * blkg_conf_prep - parse and prepare for per-blkg config update
* @blkcg: target block cgroup * @blkcg: target block cgroup
...@@ -789,6 +810,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, ...@@ -789,6 +810,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
__acquires(rcu) __acquires(disk->queue->queue_lock) __acquires(rcu) __acquires(disk->queue->queue_lock)
{ {
struct gendisk *disk; struct gendisk *disk;
struct request_queue *q;
struct blkcg_gq *blkg; struct blkcg_gq *blkg;
struct module *owner; struct module *owner;
unsigned int major, minor; unsigned int major, minor;
...@@ -807,44 +829,95 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, ...@@ -807,44 +829,95 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
if (!disk) if (!disk)
return -ENODEV; return -ENODEV;
if (part) { if (part) {
owner = disk->fops->owner; ret = -ENODEV;
put_disk(disk); goto fail;
module_put(owner);
return -ENODEV;
} }
rcu_read_lock(); q = disk->queue;
spin_lock_irq(disk->queue->queue_lock);
if (blkcg_policy_enabled(disk->queue, pol)) rcu_read_lock();
blkg = blkg_lookup_create(blkcg, disk->queue); spin_lock_irq(q->queue_lock);
else
blkg = ERR_PTR(-EOPNOTSUPP);
blkg = blkg_lookup_check(blkcg, pol, q);
if (IS_ERR(blkg)) { if (IS_ERR(blkg)) {
ret = PTR_ERR(blkg); ret = PTR_ERR(blkg);
goto fail_unlock;
}
if (blkg)
goto success;
/*
* Create blkgs walking down from blkcg_root to @blkcg, so that all
* non-root blkgs have access to their parents.
*/
while (true) {
struct blkcg *pos = blkcg;
struct blkcg *parent;
struct blkcg_gq *new_blkg;
parent = blkcg_parent(blkcg);
while (parent && !__blkg_lookup(parent, q, false)) {
pos = parent;
parent = blkcg_parent(parent);
}
/* Drop locks to do new blkg allocation with GFP_KERNEL. */
spin_unlock_irq(q->queue_lock);
rcu_read_unlock(); rcu_read_unlock();
spin_unlock_irq(disk->queue->queue_lock);
owner = disk->fops->owner; new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
put_disk(disk); if (unlikely(!new_blkg)) {
module_put(owner); ret = -ENOMEM;
/* goto fail;
* If queue was bypassing, we should retry. Do so after a
* short msleep(). It isn't strictly necessary but queue
* can be bypassing for some time and it's always nice to
* avoid busy looping.
*/
if (ret == -EBUSY) {
msleep(10);
ret = restart_syscall();
} }
return ret;
}
rcu_read_lock();
spin_lock_irq(q->queue_lock);
blkg = blkg_lookup_check(pos, pol, q);
if (IS_ERR(blkg)) {
ret = PTR_ERR(blkg);
goto fail_unlock;
}
if (blkg) {
blkg_free(new_blkg);
} else {
blkg = blkg_create(pos, q, new_blkg);
if (unlikely(IS_ERR(blkg))) {
ret = PTR_ERR(blkg);
goto fail_unlock;
}
}
if (pos == blkcg)
goto success;
}
success:
ctx->disk = disk; ctx->disk = disk;
ctx->blkg = blkg; ctx->blkg = blkg;
ctx->body = body; ctx->body = body;
return 0; return 0;
fail_unlock:
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();
fail:
owner = disk->fops->owner;
put_disk(disk);
module_put(owner);
/*
* If queue was bypassing, we should retry. Do so after a
* short msleep(). It isn't strictly necessary but queue
* can be bypassing for some time and it's always nice to
* avoid busy looping.
*/
if (ret == -EBUSY) {
msleep(10);
ret = restart_syscall();
}
return ret;
} }
EXPORT_SYMBOL_GPL(blkg_conf_prep); EXPORT_SYMBOL_GPL(blkg_conf_prep);
......
...@@ -268,10 +268,8 @@ void blk_sync_queue(struct request_queue *q) ...@@ -268,10 +268,8 @@ void blk_sync_queue(struct request_queue *q)
struct blk_mq_hw_ctx *hctx; struct blk_mq_hw_ctx *hctx;
int i; int i;
queue_for_each_hw_ctx(q, hctx, i) { queue_for_each_hw_ctx(q, hctx, i)
cancel_work_sync(&hctx->run_work); cancel_delayed_work_sync(&hctx->run_work);
cancel_delayed_work_sync(&hctx->delay_work);
}
} else { } else {
cancel_delayed_work_sync(&q->delay_work); cancel_delayed_work_sync(&q->delay_work);
} }
...@@ -500,6 +498,13 @@ void blk_set_queue_dying(struct request_queue *q) ...@@ -500,6 +498,13 @@ void blk_set_queue_dying(struct request_queue *q)
queue_flag_set(QUEUE_FLAG_DYING, q); queue_flag_set(QUEUE_FLAG_DYING, q);
spin_unlock_irq(q->queue_lock); spin_unlock_irq(q->queue_lock);
/*
* When queue DYING flag is set, we need to block new req
* entering queue, so we call blk_freeze_queue_start() to
* prevent I/O from crossing blk_queue_enter().
*/
blk_freeze_queue_start(q);
if (q->mq_ops) if (q->mq_ops)
blk_mq_wake_waiters(q); blk_mq_wake_waiters(q);
else { else {
...@@ -556,9 +561,13 @@ void blk_cleanup_queue(struct request_queue *q) ...@@ -556,9 +561,13 @@ void blk_cleanup_queue(struct request_queue *q)
* prevent that q->request_fn() gets invoked after draining finished. * prevent that q->request_fn() gets invoked after draining finished.
*/ */
blk_freeze_queue(q); blk_freeze_queue(q);
spin_lock_irq(lock); if (!q->mq_ops) {
if (!q->mq_ops) spin_lock_irq(lock);
__blk_drain_queue(q, true); __blk_drain_queue(q, true);
} else {
blk_mq_debugfs_unregister_mq(q);
spin_lock_irq(lock);
}
queue_flag_set(QUEUE_FLAG_DEAD, q); queue_flag_set(QUEUE_FLAG_DEAD, q);
spin_unlock_irq(lock); spin_unlock_irq(lock);
...@@ -669,6 +678,15 @@ int blk_queue_enter(struct request_queue *q, bool nowait) ...@@ -669,6 +678,15 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
if (nowait) if (nowait)
return -EBUSY; return -EBUSY;
/*
* read pair of barrier in blk_freeze_queue_start(),
* we need to order reading __PERCPU_REF_DEAD flag of
* .q_usage_counter and reading .mq_freeze_depth or
* queue dying flag, otherwise the following wait may
* never return if the two reads are reordered.
*/
smp_rmb();
ret = wait_event_interruptible(q->mq_freeze_wq, ret = wait_event_interruptible(q->mq_freeze_wq,
!atomic_read(&q->mq_freeze_depth) || !atomic_read(&q->mq_freeze_depth) ||
blk_queue_dying(q)); blk_queue_dying(q));
...@@ -720,6 +738,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) ...@@ -720,6 +738,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (!q->backing_dev_info) if (!q->backing_dev_info)
goto fail_split; goto fail_split;
q->stats = blk_alloc_queue_stats();
if (!q->stats)
goto fail_stats;
q->backing_dev_info->ra_pages = q->backing_dev_info->ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_SIZE; (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK; q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
...@@ -776,6 +798,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) ...@@ -776,6 +798,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
fail_ref: fail_ref:
percpu_ref_exit(&q->q_usage_counter); percpu_ref_exit(&q->q_usage_counter);
fail_bdi: fail_bdi:
blk_free_queue_stats(q->stats);
fail_stats:
bdi_put(q->backing_dev_info); bdi_put(q->backing_dev_info);
fail_split: fail_split:
bioset_free(q->bio_split); bioset_free(q->bio_split);
...@@ -889,7 +913,6 @@ int blk_init_allocated_queue(struct request_queue *q) ...@@ -889,7 +913,6 @@ int blk_init_allocated_queue(struct request_queue *q)
q->exit_rq_fn(q, q->fq->flush_rq); q->exit_rq_fn(q, q->fq->flush_rq);
out_free_flush_queue: out_free_flush_queue:
blk_free_flush_queue(q->fq); blk_free_flush_queue(q->fq);
wbt_exit(q);
return -ENOMEM; return -ENOMEM;
} }
EXPORT_SYMBOL(blk_init_allocated_queue); EXPORT_SYMBOL(blk_init_allocated_queue);
...@@ -1128,7 +1151,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, ...@@ -1128,7 +1151,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
blk_rq_init(q, rq); blk_rq_init(q, rq);
blk_rq_set_rl(rq, rl); blk_rq_set_rl(rq, rl);
blk_rq_set_prio(rq, ioc);
rq->cmd_flags = op; rq->cmd_flags = op;
rq->rq_flags = rq_flags; rq->rq_flags = rq_flags;
...@@ -1608,17 +1630,23 @@ unsigned int blk_plug_queued_count(struct request_queue *q) ...@@ -1608,17 +1630,23 @@ unsigned int blk_plug_queued_count(struct request_queue *q)
return ret; return ret;
} }
void init_request_from_bio(struct request *req, struct bio *bio) void blk_init_request_from_bio(struct request *req, struct bio *bio)
{ {
struct io_context *ioc = rq_ioc(bio);
if (bio->bi_opf & REQ_RAHEAD) if (bio->bi_opf & REQ_RAHEAD)
req->cmd_flags |= REQ_FAILFAST_MASK; req->cmd_flags |= REQ_FAILFAST_MASK;
req->errors = 0;
req->__sector = bio->bi_iter.bi_sector; req->__sector = bio->bi_iter.bi_sector;
if (ioprio_valid(bio_prio(bio))) if (ioprio_valid(bio_prio(bio)))
req->ioprio = bio_prio(bio); req->ioprio = bio_prio(bio);
else if (ioc)
req->ioprio = ioc->ioprio;
else
req->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
blk_rq_bio_prep(req->q, req, bio); blk_rq_bio_prep(req->q, req, bio);
} }
EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
{ {
...@@ -1709,7 +1737,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) ...@@ -1709,7 +1737,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
* We don't worry about that case for efficiency. It won't happen * We don't worry about that case for efficiency. It won't happen
* often, and the elevators are able to handle it. * often, and the elevators are able to handle it.
*/ */
init_request_from_bio(req, bio); blk_init_request_from_bio(req, bio);
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags)) if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
req->cpu = raw_smp_processor_id(); req->cpu = raw_smp_processor_id();
...@@ -1936,7 +1964,13 @@ generic_make_request_checks(struct bio *bio) ...@@ -1936,7 +1964,13 @@ generic_make_request_checks(struct bio *bio)
if (!blkcg_bio_issue_check(q, bio)) if (!blkcg_bio_issue_check(q, bio))
return false; return false;
trace_block_bio_queue(q, bio); if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
trace_block_bio_queue(q, bio);
/* Now that enqueuing has been traced, we need to trace
* completion as well.
*/
bio_set_flag(bio, BIO_TRACE_COMPLETION);
}
return true; return true;
not_supported: not_supported:
...@@ -2478,7 +2512,7 @@ void blk_start_request(struct request *req) ...@@ -2478,7 +2512,7 @@ void blk_start_request(struct request *req)
blk_dequeue_request(req); blk_dequeue_request(req);
if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) { if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
blk_stat_set_issue_time(&req->issue_stat); blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
req->rq_flags |= RQF_STATS; req->rq_flags |= RQF_STATS;
wbt_issue(req->q->rq_wb, &req->issue_stat); wbt_issue(req->q->rq_wb, &req->issue_stat);
} }
...@@ -2540,22 +2574,11 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) ...@@ -2540,22 +2574,11 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
{ {
int total_bytes; int total_bytes;
trace_block_rq_complete(req->q, req, nr_bytes); trace_block_rq_complete(req, error, nr_bytes);
if (!req->bio) if (!req->bio)
return false; return false;
/*
* For fs requests, rq is just carrier of independent bio's
* and each partial completion should be handled separately.
* Reset per-request error on each partial completion.
*
* TODO: tj: This is too subtle. It would be better to let
* low level drivers do what they see fit.
*/
if (!blk_rq_is_passthrough(req))
req->errors = 0;
if (error && !blk_rq_is_passthrough(req) && if (error && !blk_rq_is_passthrough(req) &&
!(req->rq_flags & RQF_QUIET)) { !(req->rq_flags & RQF_QUIET)) {
char *error_type; char *error_type;
...@@ -2601,6 +2624,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) ...@@ -2601,6 +2624,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
if (bio_bytes == bio->bi_iter.bi_size) if (bio_bytes == bio->bi_iter.bi_size)
req->bio = bio->bi_next; req->bio = bio->bi_next;
/* Completion has already been traced */
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
req_bio_endio(req, bio, bio_bytes, error); req_bio_endio(req, bio, bio_bytes, error);
total_bytes += bio_bytes; total_bytes += bio_bytes;
...@@ -2699,7 +2724,7 @@ void blk_finish_request(struct request *req, int error) ...@@ -2699,7 +2724,7 @@ void blk_finish_request(struct request *req, int error)
struct request_queue *q = req->q; struct request_queue *q = req->q;
if (req->rq_flags & RQF_STATS) if (req->rq_flags & RQF_STATS)
blk_stat_add(&q->rq_stats[rq_data_dir(req)], req); blk_stat_add(req);
if (req->rq_flags & RQF_QUEUED) if (req->rq_flags & RQF_QUEUED)
blk_queue_end_tag(q, req); blk_queue_end_tag(q, req);
...@@ -2776,7 +2801,7 @@ static bool blk_end_bidi_request(struct request *rq, int error, ...@@ -2776,7 +2801,7 @@ static bool blk_end_bidi_request(struct request *rq, int error,
* %false - we are done with this request * %false - we are done with this request
* %true - still buffers pending for this request * %true - still buffers pending for this request
**/ **/
bool __blk_end_bidi_request(struct request *rq, int error, static bool __blk_end_bidi_request(struct request *rq, int error,
unsigned int nr_bytes, unsigned int bidi_bytes) unsigned int nr_bytes, unsigned int bidi_bytes)
{ {
if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes)) if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
...@@ -2828,43 +2853,6 @@ void blk_end_request_all(struct request *rq, int error) ...@@ -2828,43 +2853,6 @@ void blk_end_request_all(struct request *rq, int error)
} }
EXPORT_SYMBOL(blk_end_request_all); EXPORT_SYMBOL(blk_end_request_all);
/**
* blk_end_request_cur - Helper function to finish the current request chunk.
* @rq: the request to finish the current chunk for
* @error: %0 for success, < %0 for error
*
* Description:
* Complete the current consecutively mapped chunk from @rq.
*
* Return:
* %false - we are done with this request
* %true - still buffers pending for this request
*/
bool blk_end_request_cur(struct request *rq, int error)
{
return blk_end_request(rq, error, blk_rq_cur_bytes(rq));
}
EXPORT_SYMBOL(blk_end_request_cur);
/**
* blk_end_request_err - Finish a request till the next failure boundary.
* @rq: the request to finish till the next failure boundary for
* @error: must be negative errno
*
* Description:
* Complete @rq till the next failure boundary.
*
* Return:
* %false - we are done with this request
* %true - still buffers pending for this request
*/
bool blk_end_request_err(struct request *rq, int error)
{
WARN_ON(error >= 0);
return blk_end_request(rq, error, blk_rq_err_bytes(rq));
}
EXPORT_SYMBOL_GPL(blk_end_request_err);
/** /**
* __blk_end_request - Helper function for drivers to complete the request. * __blk_end_request - Helper function for drivers to complete the request.
* @rq: the request being processed * @rq: the request being processed
...@@ -2924,26 +2912,6 @@ bool __blk_end_request_cur(struct request *rq, int error) ...@@ -2924,26 +2912,6 @@ bool __blk_end_request_cur(struct request *rq, int error)
} }
EXPORT_SYMBOL(__blk_end_request_cur); EXPORT_SYMBOL(__blk_end_request_cur);
/**
* __blk_end_request_err - Finish a request till the next failure boundary.
* @rq: the request to finish till the next failure boundary for
* @error: must be negative errno
*
* Description:
* Complete @rq till the next failure boundary. Must be called
* with queue lock held.
*
* Return:
* %false - we are done with this request
* %true - still buffers pending for this request
*/
bool __blk_end_request_err(struct request *rq, int error)
{
WARN_ON(error >= 0);
return __blk_end_request(rq, error, blk_rq_err_bytes(rq));
}
EXPORT_SYMBOL_GPL(__blk_end_request_err);
void blk_rq_bio_prep(struct request_queue *q, struct request *rq, void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
struct bio *bio) struct bio *bio)
{ {
...@@ -3106,6 +3074,13 @@ int kblockd_schedule_work_on(int cpu, struct work_struct *work) ...@@ -3106,6 +3074,13 @@ int kblockd_schedule_work_on(int cpu, struct work_struct *work)
} }
EXPORT_SYMBOL(kblockd_schedule_work_on); EXPORT_SYMBOL(kblockd_schedule_work_on);
int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
unsigned long delay)
{
return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
}
EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
int kblockd_schedule_delayed_work(struct delayed_work *dwork, int kblockd_schedule_delayed_work(struct delayed_work *dwork,
unsigned long delay) unsigned long delay)
{ {
......
...@@ -69,8 +69,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk, ...@@ -69,8 +69,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
if (unlikely(blk_queue_dying(q))) { if (unlikely(blk_queue_dying(q))) {
rq->rq_flags |= RQF_QUIET; rq->rq_flags |= RQF_QUIET;
rq->errors = -ENXIO; __blk_end_request_all(rq, -ENXIO);
__blk_end_request_all(rq, rq->errors);
spin_unlock_irq(q->queue_lock); spin_unlock_irq(q->queue_lock);
return; return;
} }
...@@ -92,11 +91,10 @@ EXPORT_SYMBOL_GPL(blk_execute_rq_nowait); ...@@ -92,11 +91,10 @@ EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
* Insert a fully prepared request at the back of the I/O scheduler queue * Insert a fully prepared request at the back of the I/O scheduler queue
* for execution and wait for completion. * for execution and wait for completion.
*/ */
int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, void blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk,
struct request *rq, int at_head) struct request *rq, int at_head)
{ {
DECLARE_COMPLETION_ONSTACK(wait); DECLARE_COMPLETION_ONSTACK(wait);
int err = 0;
unsigned long hang_check; unsigned long hang_check;
rq->end_io_data = &wait; rq->end_io_data = &wait;
...@@ -108,10 +106,5 @@ int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, ...@@ -108,10 +106,5 @@ int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk,
while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2))); while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2)));
else else
wait_for_completion_io(&wait); wait_for_completion_io(&wait);
if (rq->errors)
err = -EIO;
return err;
} }
EXPORT_SYMBOL(blk_execute_rq); EXPORT_SYMBOL(blk_execute_rq);
...@@ -447,7 +447,7 @@ void blk_insert_flush(struct request *rq) ...@@ -447,7 +447,7 @@ void blk_insert_flush(struct request *rq)
if (q->mq_ops) if (q->mq_ops)
blk_mq_end_request(rq, 0); blk_mq_end_request(rq, 0);
else else
__blk_end_bidi_request(rq, 0, 0, 0); __blk_end_request(rq, 0, 0);
return; return;
} }
...@@ -497,8 +497,7 @@ void blk_insert_flush(struct request *rq) ...@@ -497,8 +497,7 @@ void blk_insert_flush(struct request *rq)
* Description: * Description:
* Issue a flush for the block device in question. Caller can supply * Issue a flush for the block device in question. Caller can supply
* room for storing the error offset in case of a flush error, if they * room for storing the error offset in case of a flush error, if they
* wish to. If WAIT flag is not passed then caller may check only what * wish to.
* request was pushed in some internal queue for later handling.
*/ */
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
sector_t *error_sector) sector_t *error_sector)
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment