fs/buffer.c · 97ff29c22ec3df25621561194692e7e945fcf489 · nexedi / linux

[PATCH] anticipatory I/O scheduler · 97ff29c2
Andrew Morton authored Jul 04, 2003
From: Nick Piggin <piggin@cyberone.com.au>

This is the core anticipatory IO scheduler.  There are nearly 100 changesets
in this and five months work.  I really cannot describe it fully here.

Major points:

- It works by recognising that reads are dependent: we don't know where the
  next read will occur, but it's probably close-by the previous one.  So once
  a read has completed we leave the disk idle, anticipating that a request
  for a nearby read will come in.

- There is read batching and write batching logic.

  - when we're servicing a batch of writes we will refuse to seek away
    for a read for some tens of milliseconds.  Then the write stream is
    preempted.

  - when we're servicing a batch of reads (via anticipation) we'll do
    that for some tens of milliseconds, then preempt.

- There are request deadlines, for latency and fairness.
  The oldest outstanding request is examined at regular intervals. If
  this request is older than a specific deadline, it will be the next
  one dispatched. This gives a good fairness heuristic while being simple
  because processes tend to have localised IO.


Just about all of the rest of the complexity involves an array of fixups
which prevent most of teh obvious failure modes with anticipation: trying to
not leave the disk head pointlessly idle.  Some of these algorithms are:

- Process tracking.  If the process whose read we are anticipating submits
  a write, abandon anticipation.

- Process exit tracking.  If the process whose read we are anticipating
  exits, abandon anticipation.

- Process IO history.  We accumulate statistical info on the process's
  recent IO patterns to aid in making decisions about how long to anticipate
  new reads.

  Currently thinktime and seek distance are tracked. Thinktime is the
  time between when a process's last request has completed and when it
  submits another one. Seek distance is simply the number of sectors
  between each read request. If either statistic becomes too high, the
  it isn't anticipated that the process will submit another read.

The above all means that we need a per-process "io context".  This is a fully
refcounted structure.  In this patch it is AS-only.  later we generalise it a
little so other IO schedulers could use the same framework.

- Requests are grouped as synchronous and asynchronous whereas deadline
  scheduler groups requests as reads and writes. This can provide better
  sync write performance, and may give better responsiveness with journalling
  filesystems (although we haven't done that yet).

  We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
  current->flags.  The plan is to remove this later, and to propagate the
  sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
  request->flags.  Once that is done, direct-io needs to set the BIO sync
  hint as well.

- There is also quite a bit of complexity gone into bashing TCQ into
  submission. Timing for a read batch is not started until the first read
  request actually completes. A read batch also does not start until all
  outstanding writes have completed.

AS is the default IO scheduler.  deadline may be chosen by booting with
"elevator=deadline".

There are a few reasons for retaining deadline:

- AS is often slower than deadline in random IO loads with large TCQ
  windows. The usual real world task here is OLTP database loads.

- deadline is presumably more stable.

- deadline is much simpler.



The tunable per-queue entries under /sys/block/*/iosched/ are all in
milliseconds:

* read_expire

  Controls how long until a request becomes "expired".

  It also controls the interval between which expired requests are served,
  so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
  is the next on the expired list.

  Obviously it can't make the disk go faster.  Result is basically the
  timeslice a reader gets in the presence of other IO.  100*((seek time /
  read_expire) + 1) is very roughly the % streaming read efficiency your disk
  should get in the presence of multiple readers.

* read_batch_expire

  Controls how much time a batch of reads is given before pending writes
  are served.  Higher value is more efficient.  Shouldn't really be below
  read_expire.

* write_ versions of the above

* antic_expire

  Controls the maximum amount of time we can anticipate a good read before
  giving up.  Many other factors may cause anticipation to be stopped early,
  or some processes will not be "anticipated" at all.  Should be a bit higher
  for big seek time devices though not a linear correspondance - most
  processes have only a few ms thinktime.
97ff29c2
buffer.c 77.7 KB
Replace buffer.c