Commit 96a291c3 authored by Paolo Valente's avatar Paolo Valente Committed by Jens Axboe

block, bfq: preempt lower-weight or lower-priority queues

BFQ enqueues the I/O coming from each process into a separate
bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
served for at most timeout_sync milliseconds (default: 125 ms). This
service scheme is prone to the following inaccuracy.

While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
receive I/O, and, according to BFQ's scheduling policy, may become the
right bfq_queue to serve, in place of the currently in-service
bfq_queue. In this respect, postponing the service of Q2 to after the
service of Q1 finishes may delay the completion of Q2's I/O, compared
with an ideal service in which all non-empty bfq_queues are served in
parallel, and every non-empty bfq_queue is served at a rate
proportional to the bfq_queue's weight. This additional delay is equal
at most to the time Q1 may unjustly remain in service before switching
to Q2.

If Q1 and Q2 have the same weight, then this time is most likely
negligible compared with the completion time to be guaranteed to Q2's
I/O. In addition, first, one of the reasons why BFQ may want to serve
Q1 for a while is that this boosts throughput and, second, serving Q1
longer reduces BFQ's overhead. As a conclusion, it is usually better
not to preempt Q1 if both Q1 and Q2 have the same weight.

In contrast, as Q2's weight or priority becomes higher and higher
compared with that of Q1, the above delay becomes larger and larger,
compared with the I/O completion times that have to be guaranteed to
Q2 according to Q2's weight. So reducing this delay may be more
important than avoiding the costs of preempting Q1.

Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
triggers a new choice of the next bfq_queue to serve. If Q2 really is
the next bfq_queue to serve, then Q2 will be set in service
immediately.

This change reduces the component of the I/O latency caused by the
above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
a process doing sporadic random reads while another process is doing
continuous sequential reads.
Signed-off-by: default avatarNicola Bottura <bottura.nicola95@gmail.com>
Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
parent 13a857a4
...@@ -1428,17 +1428,19 @@ static int bfq_min_budget(struct bfq_data *bfqd) ...@@ -1428,17 +1428,19 @@ static int bfq_min_budget(struct bfq_data *bfqd)
* mechanism may be re-designed in such a way to make it possible to * mechanism may be re-designed in such a way to make it possible to
* know whether preemption is needed without needing to update service * know whether preemption is needed without needing to update service
* trees). In addition, queue preemptions almost always cause random * trees). In addition, queue preemptions almost always cause random
* I/O, and thus loss of throughput. Because of these facts, the next * I/O, which may in turn cause loss of throughput. Finally, there may
* function adopts the following simple scheme to avoid both costly * even be no in-service queue when the next function is invoked (so,
* operations and too frequent preemptions: it requests the expiration * no queue to compare timestamps with). Because of these facts, the
* of the in-service queue (unconditionally) only for queues that need * next function adopts the following simple scheme to avoid costly
* to recover a hole, or that either are weight-raised or deserve to * operations, too frequent preemptions and too many dependencies on
* be weight-raised. * the state of the scheduler: it requests the expiration of the
* in-service queue (unconditionally) only for queues that need to
* recover a hole. Then it delegates to other parts of the code the
* responsibility of handling the above case 2.
*/ */
static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd, static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
struct bfq_queue *bfqq, struct bfq_queue *bfqq,
bool arrived_in_time, bool arrived_in_time)
bool wr_or_deserves_wr)
{ {
struct bfq_entity *entity = &bfqq->entity; struct bfq_entity *entity = &bfqq->entity;
...@@ -1493,7 +1495,7 @@ static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd, ...@@ -1493,7 +1495,7 @@ static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
entity->budget = max_t(unsigned long, bfqq->max_budget, entity->budget = max_t(unsigned long, bfqq->max_budget,
bfq_serv_to_charge(bfqq->next_rq, bfqq)); bfq_serv_to_charge(bfqq->next_rq, bfqq));
bfq_clear_bfqq_non_blocking_wait_rq(bfqq); bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
return wr_or_deserves_wr; return false;
} }
/* /*
...@@ -1611,6 +1613,36 @@ static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd, ...@@ -1611,6 +1613,36 @@ static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd,
bfqd->bfq_wr_min_idle_time); bfqd->bfq_wr_min_idle_time);
} }
/*
* Return true if bfqq is in a higher priority class, or has a higher
* weight than the in-service queue.
*/
static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq,
struct bfq_queue *in_serv_bfqq)
{
int bfqq_weight, in_serv_weight;
if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class)
return true;
if (in_serv_bfqq->entity.parent == bfqq->entity.parent) {
bfqq_weight = bfqq->entity.weight;
in_serv_weight = in_serv_bfqq->entity.weight;
} else {
if (bfqq->entity.parent)
bfqq_weight = bfqq->entity.parent->weight;
else
bfqq_weight = bfqq->entity.weight;
if (in_serv_bfqq->entity.parent)
in_serv_weight = in_serv_bfqq->entity.parent->weight;
else
in_serv_weight = in_serv_bfqq->entity.weight;
}
return bfqq_weight > in_serv_weight;
}
static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
struct bfq_queue *bfqq, struct bfq_queue *bfqq,
int old_wr_coeff, int old_wr_coeff,
...@@ -1655,8 +1687,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, ...@@ -1655,8 +1687,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
*/ */
bfqq_wants_to_preempt = bfqq_wants_to_preempt =
bfq_bfqq_update_budg_for_activation(bfqd, bfqq, bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
arrived_in_time, arrived_in_time);
wr_or_deserves_wr);
/* /*
* If bfqq happened to be activated in a burst, but has been * If bfqq happened to be activated in a burst, but has been
...@@ -1721,16 +1752,40 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, ...@@ -1721,16 +1752,40 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
/* /*
* Expire in-service queue only if preemption may be needed * Expire in-service queue only if preemption may be needed
* for guarantees. In this respect, the function * for guarantees. In particular, we care only about two
* next_queue_may_preempt just checks a simple, necessary * cases. The first is that bfqq has to recover a service
* condition, and not a sufficient condition based on * hole, as explained in the comments on
* timestamps. In fact, for the latter condition to be * bfq_bfqq_update_budg_for_activation(), i.e., that
* evaluated, timestamps would need first to be updated, and * bfqq_wants_to_preempt is true. However, if bfqq does not
* this operation is quite costly (see the comments on the * carry time-critical I/O, then bfqq's bandwidth is less
* function bfq_bfqq_update_budg_for_activation). * important than that of queues that carry time-critical I/O.
* So, as a further constraint, we consider this case only if
* bfqq is at least as weight-raised, i.e., at least as time
* critical, as the in-service queue.
*
* The second case is that bfqq is in a higher priority class,
* or has a higher weight than the in-service queue. If this
* condition does not hold, we don't care because, even if
* bfqq does not start to be served immediately, the resulting
* delay for bfqq's I/O is however lower or much lower than
* the ideal completion time to be guaranteed to bfqq's I/O.
*
* In both cases, preemption is needed only if, according to
* the timestamps of both bfqq and of the in-service queue,
* bfqq actually is the next queue to serve. So, to reduce
* useless preemptions, the return value of
* next_queue_may_preempt() is considered in the next compound
* condition too. Yet next_queue_may_preempt() just checks a
* simple, necessary condition for bfqq to be the next queue
* to serve. In fact, to evaluate a sufficient condition, the
* timestamps of the in-service queue would need to be
* updated, and this operation is quite costly (see the
* comments on bfq_bfqq_update_budg_for_activation()).
*/ */
if (bfqd->in_service_queue && bfqq_wants_to_preempt && if (bfqd->in_service_queue &&
bfqd->in_service_queue->wr_coeff < bfqq->wr_coeff && ((bfqq_wants_to_preempt &&
bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) ||
bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue)) &&
next_queue_may_preempt(bfqd)) next_queue_may_preempt(bfqd))
bfq_bfqq_expire(bfqd, bfqd->in_service_queue, bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
false, BFQQE_PREEMPTED); false, BFQQE_PREEMPTED);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment