• Andrew Morton's avatar
    [PATCH] anticipatory I/O scheduler · 97ff29c2
    Andrew Morton authored
    From: Nick Piggin <piggin@cyberone.com.au>
    
    This is the core anticipatory IO scheduler.  There are nearly 100 changesets
    in this and five months work.  I really cannot describe it fully here.
    
    Major points:
    
    - It works by recognising that reads are dependent: we don't know where the
      next read will occur, but it's probably close-by the previous one.  So once
      a read has completed we leave the disk idle, anticipating that a request
      for a nearby read will come in.
    
    - There is read batching and write batching logic.
    
      - when we're servicing a batch of writes we will refuse to seek away
        for a read for some tens of milliseconds.  Then the write stream is
        preempted.
    
      - when we're servicing a batch of reads (via anticipation) we'll do
        that for some tens of milliseconds, then preempt.
    
    - There are request deadlines, for latency and fairness.
      The oldest outstanding request is examined at regular intervals. If
      this request is older than a specific deadline, it will be the next
      one dispatched. This gives a good fairness heuristic while being simple
      because processes tend to have localised IO.
    
    
    Just about all of the rest of the complexity involves an array of fixups
    which prevent most of teh obvious failure modes with anticipation: trying to
    not leave the disk head pointlessly idle.  Some of these algorithms are:
    
    - Process tracking.  If the process whose read we are anticipating submits
      a write, abandon anticipation.
    
    - Process exit tracking.  If the process whose read we are anticipating
      exits, abandon anticipation.
    
    - Process IO history.  We accumulate statistical info on the process's
      recent IO patterns to aid in making decisions about how long to anticipate
      new reads.
    
      Currently thinktime and seek distance are tracked. Thinktime is the
      time between when a process's last request has completed and when it
      submits another one. Seek distance is simply the number of sectors
      between each read request. If either statistic becomes too high, the
      it isn't anticipated that the process will submit another read.
    
    The above all means that we need a per-process "io context".  This is a fully
    refcounted structure.  In this patch it is AS-only.  later we generalise it a
    little so other IO schedulers could use the same framework.
    
    - Requests are grouped as synchronous and asynchronous whereas deadline
      scheduler groups requests as reads and writes. This can provide better
      sync write performance, and may give better responsiveness with journalling
      filesystems (although we haven't done that yet).
    
      We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
      current->flags.  The plan is to remove this later, and to propagate the
      sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
      request->flags.  Once that is done, direct-io needs to set the BIO sync
      hint as well.
    
    - There is also quite a bit of complexity gone into bashing TCQ into
      submission. Timing for a read batch is not started until the first read
      request actually completes. A read batch also does not start until all
      outstanding writes have completed.
    
    AS is the default IO scheduler.  deadline may be chosen by booting with
    "elevator=deadline".
    
    There are a few reasons for retaining deadline:
    
    - AS is often slower than deadline in random IO loads with large TCQ
      windows. The usual real world task here is OLTP database loads.
    
    - deadline is presumably more stable.
    
    - deadline is much simpler.
    
    
    
    The tunable per-queue entries under /sys/block/*/iosched/ are all in
    milliseconds:
    
    * read_expire
    
      Controls how long until a request becomes "expired".
    
      It also controls the interval between which expired requests are served,
      so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
      is the next on the expired list.
    
      Obviously it can't make the disk go faster.  Result is basically the
      timeslice a reader gets in the presence of other IO.  100*((seek time /
      read_expire) + 1) is very roughly the % streaming read efficiency your disk
      should get in the presence of multiple readers.
    
    * read_batch_expire
    
      Controls how much time a batch of reads is given before pending writes
      are served.  Higher value is more efficient.  Shouldn't really be below
      read_expire.
    
    * write_ versions of the above
    
    * antic_expire
    
      Controls the maximum amount of time we can anticipate a good read before
      giving up.  Many other factors may cause anticipation to be stopped early,
      or some processes will not be "anticipated" at all.  Should be a bit higher
      for big seek time devices though not a linear correspondance - most
      processes have only a few ms thinktime.
    97ff29c2
buffer.c 77.7 KB