- 08 Apr, 2017 27 commits
-
-
Martin K. Petersen authored
Separating discards and zeroout operations allows us to remove the LBPRZ block zeroing constraints from discards and honor the device preferences for UNMAP commands. If supported by the device, we'll also choose UNMAP over one of the WRITE SAME variants for discards. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Now that zeroout and discards are distinct operations we need to separate the policy of choosing the appropriate command. Create a zeroing_mode which can be one of: write: Zeroout assist not present, use regular WRITE writesame: Allow WRITE SAME(10/16) with a zeroed payload writesame_16_unmap: Allow WRITE SAME(16) with UNMAP writesame_10_unmap: Allow WRITE SAME(10) with UNMAP The last two are conditional on the device being thin provisioned with LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively. Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And if none of the _unmap variants are supported, regular WRITE SAME will be used if the device supports it. The zeroout_mode is exported in sysfs and the detected mode for a given device can be overridden using the string constants above. With this change in place we can now issue WRITE SAME(16) with UNMAP set for block zeroing applications that require hard guarantees and logical_block_size granularity. And at the same time use the UNMAP command with the device's preferred granulary and alignment for discard operations. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It seems like DRBD assumes its on the wire TRIM request always zeroes data. Use that fact to implement REQ_OP_WRITE_ZEROES. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
drbd always wants its discard wire operations to zero the blocks, so use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of reinventing it poorly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that support efficient zeroing, we can remove the call to blkdev_issue_discard. This means we only have two ways of zeroing left and can simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
mmc only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
rsxx only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
rbd only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It's just a in-driver reimplementation of writing zeroes to the pages, which fails if the discards aren't page aligned. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It's identical to discard as hole punches will always leave us with zeroes on reads. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Just the same as discard if the block size equals the system page size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
But now for the real NVMe Write Zeroes yet, just to get rid of the discard abuse for zeroing. Also rename the quirk flag to be a bit more self-explanatory. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Try to use a write same with unmap bit variant if the device supports it and the caller allows for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
This gets us support for non-discard efficient write of zeroes (e.g. NVMe) and prepares for removing the discard_zeroes_data flag. Also remove a pointless discard support check, which is done in blkdev_issue_discard already. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if the caller doesn't want them. Also clean up the convoluted check for the return condition that this new flag is added to. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
If this flag is set logical provisioning capable device should release space for the zeroed blocks if possible, if it is not set devices should keep the blocks anchored. Also remove an out of sync kerneldoc comment for a static function that would have become even more out of data with this change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with similar semantics, but without referring to diѕcard. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
We'll always use the WRITE ZEROES code for zeroing now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It seems like the code currently passes whatever it was using for writes to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't need any payload. Untested, and confused by the code, maybe someone who understands it better than me can help.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Fix up do_region to not allocate a bio_vec for discards. We've got rid of the discard payload allocated by the caller years ago. Obviously this wasn't actually harmful given how long it's been there, but it's still good to avoid the pointless allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations that limit the write zeroes size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Make life easy for implementations that needs to send a data buffer to the device (e.g. SCSI) by numbering it as a data out command. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Split sd_setup_discard_cmnd into one function per provisioning type. While this creates some very slight duplication of boilerplate code it keeps the code modular for additions of new provisioning types, and for reusing the write same functions for the upcoming scsi implementation of the Write Zeroes operation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
- 07 Apr, 2017 13 commits
-
-
Scott Bauer authored
Lets not flood the kernel log with messages unless the user requests so. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
The blk_mq_dispatch_rq_list() implementation got modified several times but the comments in that function were not updated every time. Since it is nontrivial what is going on, update the comments in blk_mq_dispatch_rq_list(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
Since the next patch in this series will use RCU to iterate over tag_list, make this safe. Add lockdep_assert_held() statements in functions that iterate over tag_list to make clear that using list_for_each_entry() instead of list_for_each_entry_rcu() is fine in these functions. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Trivial cleanup. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Minor cleanup that makes it easier to figure out what's going on in the driver tag allocation failure path of blk_mq_dispatch_rq_list(). Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Schedulers need to be informed when a hardware queue is added or removed at runtime so they can allocate/free per-hardware queue data. So, replace the blk_mq_sched_init_hctx_data() helper, which only makes sense at init time, with .init_hctx() and .exit_hctx() hooks. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
We've added a considerable amount of fixes for stalls and issues with the blk-mq scheduling in the 4.11 series since forking off the for-4.12/block branch. We need to do improvements on top of that for 4.12, so pull in the previous fixes to make our lives easier going forward. Signed-off-by: Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
To improve scalability, if hardware queues are shared, restart a single hardware queue in round-robin fashion. Rename blk_mq_sched_restart_queues() to reflect the new semantics. Remove blk_mq_sched_mark_restart_queue() because this function has no callers. Remove flag QUEUE_FLAG_RESTART because this patch removes the code that uses this flag. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
While running the srp-test software I noticed that request processing stalls sporadically at the beginning of a test, namely when mkfs is run against a dm-mpath device. Every time when that happened the following command was sufficient to resume request processing: echo run >/sys/kernel/debug/block/dm-0/state This patch avoids that such request processing stalls occur. The test I ran is as follows: while srp-test/run_tests -d -r 30 -t 02-mq; do :; done Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block driver that implements that function is responsible for rerunning the hardware queue once requests can be queued again successfully. commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq() for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions that are intended to rerun a busy queue such that these examine all hardware queues instead of only stopped queues. Since no other functions than scsi_internal_device_block() and scsi_internal_device_unblock() should ever stop or restart a SCSI queue, change the blk_mq_delay_queue() call into a blk_mq_delay_run_hw_queue() call. Fixes: commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped queues") Fixes: commit 7e79dadc ("blk-mq: stop hardware queue in blk_mq_delay_queue()") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
Introduce a function that runs a hardware queue unconditionally after a delay. Note: there is already a function that stops and restarts a hardware queue after a delay, namely blk_mq_delay_queue(). This function will be used in the next patch in this series. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
NeilBrown authored
Currently only dm and md/raid5 bios trigger trace_block_bio_complete(). Now that we have bio_chain() and bio_inc_remaining(), it is not possible, in general, for a driver to know when the bio is really complete. Only bio_endio() knows that. So move the trace_block_bio_complete() call to bio_endio(). Now trace_block_bio_complete() pairs with trace_block_bio_queue(). Any bio for which a 'queue' event is traced, will subsequently generate a 'complete' event. There are a few cases where completion tracing is not wanted. 1/ If blk_update_request() has already generated a completion trace event at the 'request' level, there is no point generating one at the bio level too. In this case the bi_sector and bi_size will have changed, so the bio level event would be wrong 2/ If the bio hasn't actually been queued yet, but is being aborted early, then a trace event could be confusing. Some filesystems call bio_endio() but do not want tracing. 3/ The bio_integrity code interposes itself by replacing bi_end_io, then restoring it and calling bio_endio() again. This would produce two identical trace events if left like that. To handle these, we introduce a flag BIO_TRACE_COMPLETION and only produce the trace event when this is set. We address point 1 above by clearing the flag in blk_update_request(). We address point 2 above by only setting the flag when generic_make_request() is called. We address point 3 above by clearing the flag after generating a completion event. When bio_split() is used on a bio, particularly in blk_queue_split(), there is an extra complication. A new bio is split off the front, and may be handle directly without going through generic_make_request(). The old bio, which has been advanced, is passed to generic_make_request(), so it will trigger a trace event a second time. Probably the best result when a split happens is to see a single 'queue' event for the whole bio, then multiple 'complete' events - one for each component. To achieve this was can: - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split() - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set. This way, the split-off bio won't create a queue event, the original won't either even if it re-submitted to generic_make_request(), but both will produce completion events, each for their own range. So if generic_make_request() is called (which generates a QUEUED event), then bi_endio() will create a single COMPLETE event for each range that the bio is split into, unless the driver has explicitly requested it not to. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
NeilBrown authored
The comment for the 'flags' field of 'bio' mentions "command" which is no longer stored there, and doesn't mention the bvec pool number, which is. BIO_RESET_BITS is set in such a way that it would need to be updated if new bits were added, which is easy to miss. BVEC_POOL_BITS is larger than needed. The BVEC_POOL_IDX() ranges from 0 to 6, so 3 bits are sufficient. This patch make improvements in each of these areas. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-