block, bfq: tune service injection basing on request service times

The processes associated with a bfq_queue, say Q, may happen to generate their cumulative I/O at a lower rate than the rate at which the device could serve the same I/O. This is rather probable, e.g., if only one process is associated with Q and the device is an SSD. It results in Q becoming often empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty, then, during the service of Q, there will be frequent "service holes", i.e., time intervals during which Q gets empty and the device can only consume the I/O already queued in its hardware queues. This easily causes considerable losses of throughput. To counter this problem, BFQ implements a request injection mechanism, which tries to fill the above service holes with I/O requests taken from other bfq_queues. The hard part in this mechanism is finding the right amount of I/O to inject, so as to both boost throughput and not break Q's bandwidth and latency guarantees. To this goal, the current version of this mechanism measures the bandwidth enjoyed by Q while it is being served, and tries to inject the maximum possible amount of extra service that does not cause Q's bandwidth to decrease too much. This solution has an important shortcoming. For bandwidth measurements to be stable and reliable, Q must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not hold with many workloads. This commit addresses this issue by changing the way the amount of injection allowed is dynamically computed. It tunes injection as a function of the service times of single I/O requests of Q, instead of Q's bandwidth. Single-request service times are evidently meaningful even if Q gets very few I/O requests completed while it is in service. As a testbed for this new solution, we measured the throughput reached by BFQ for one of the nastiest workloads and configurations for this scheduler: the workload generated by the dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on a PLEXTOR PX-256M5. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>

block, bfq: tune service injection basing on request service times
The processes associated with a bfq_queue, say Q, may happen to generate their cumulative I/O at a lower rate than the rate at which the device could serve the same I/O. This is rather probable, e.g., if only one process is associated with Q and the device is an SSD. It results in Q becoming often empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty, then, during the service of Q, there will be frequent "service holes", i.e., time intervals during which Q gets empty and the device can only consume the I/O already queued in its hardware queues. This easily causes considerable losses of throughput. To counter this problem, BFQ implements a request injection mechanism, which tries to fill the above service holes with I/O requests taken from other bfq_queues. The hard part in this mechanism is finding the right amount of I/O to inject, so as to both boost throughput and not break Q's bandwidth and latency guarantees. To this goal, the current version of this mechanism measures the bandwidth enjoyed by Q while it is being served, and tries to inject the maximum possible amount of extra service that does not cause Q's bandwidth to decrease too much. This solution has an important shortcoming. For bandwidth measurements to be stable and reliable, Q must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not hold with many workloads. This commit addresses this issue by changing the way the amount of injection allowed is dynamically computed. It tunes injection as a function of the service times of single I/O requests of Q, instead of Q's bandwidth. Single-request service times are evidently meaningful even if Q gets very few I/O requests completed while it is in service. As a testbed for this new solution, we measured the throughput reached by BFQ for one of the nastiest workloads and configurations for this scheduler: the workload generated by the dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on a PLEXTOR PX-256M5. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2341d662 · Paolo Valente · Jens Axboe · fb53ac6c · 2341d662 · 2341d662
Commit 2341d662 authored Mar 12, 2019 by Paolo Valente Committed by Jens Axboe Apr 01, 2019
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 409 additions and 59 deletions

block/bfq-iosched.c block/bfq-iosched.c +381 -36

block/bfq-iosched.h block/bfq-iosched.h +28 -23

No files found.
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -240,6 +240,13 @@ struct bfq_queue {
 	/* next ioprio and ioprio class if a change is in progress */
 	unsigned short new_ioprio, new_ioprio_class;

+	/* last total-service-time sample, see bfq_update_inject_limit() */
+	u64 last_serv_time_ns;
+	/* limit for request injection */
+	unsigned int inject_limit;
+	/* last time the inject limit has been decreased, in jiffies */
+	unsigned long decrease_time_jif;
+
 	/*
 	 * Shared bfq_queue if queue is cooperating with one or more
 	 * other queues.
@@ -357,29 +364,6 @@ struct bfq_queue {

 	/* max service rate measured so far */
 	u32 max_service_rate;
-	/*
-	 * Ratio between the service received by bfqq while it is in
-	 * service, and the cumulative service (of requests of other
-	 * queues) that may be injected while bfqq is empty but still
-	 * in service. To increase precision, the coefficient is
-	 * measured in tenths of unit. Here are some example of (1)
-	 * ratios, (2) resulting percentages of service injected
-	 * w.r.t. to the total service dispatched while bfqq is in
-	 * service, and (3) corresponding values of the coefficient:
-	 * 1 (50%) -> 10
-	 * 2 (33%) -> 20
-	 * 10 (9%) -> 100
-	 * 9.9 (9%) -> 99
-	 * 1.5 (40%) -> 15
-	 * 0.5 (66%) -> 5
-	 * 0.1 (90%) -> 1
-	 *
-	 * So, if the coefficient is lower than 10, then
-	 * injected service is more than bfqq service.
-	 */
-	unsigned int inject_coeff;
-	/* amount of service injected in current service slot */
-	unsigned int injected_service;
 };

 /**
@@ -544,6 +528,26 @@ struct bfq_data {
 	/* time of last request completion (ns) */
 	u64 last_completion;

+	/* time of last transition from empty to non-empty (ns) */
+	u64 last_empty_occupied_ns;
+
+	/*
+	 * Flag set to activate the sampling of the total service time
+	 * of a just-arrived first I/O request (see
+	 * bfq_update_inject_limit()). This will cause the setting of
+	 * waited_rq when the request is finally dispatched.
+	 */
+	bool wait_dispatch;
+	/*
+	 *  If set, then bfq_update_inject_limit() is invoked when
+	 *  waited_rq is eventually completed.
+	 */
+	struct request *waited_rq;
+	/*
+	 * True if some request has been injected during the last service hole.
+	 */
+	bool rqs_injected;
+
 	/* time of first rq dispatch in current observation interval (ns) */
 	u64 first_dispatch;
 	/* time of last rq dispatch in current observation interval (ns) */
@@ -553,6 +557,7 @@ struct bfq_data {
 	ktime_t last_budget_start;
 	/* beginning of the last idle slice */
 	ktime_t last_idling_start;
+	unsigned long last_idling_start_jiffies;

 	/* number of samples in current observation interval */
 	int peak_rate_samples;