1. 22 Nov, 2016 3 commits
    • NeilBrown's avatar
      md/raid1: add failfast handling for reads. · 2e52d449
      NeilBrown authored
      If a device is marked FailFast and it is not the only device
      we can read from, we mark the bio with REQ_FAILFAST_* flags.
      
      If this does fail, we don't try read repair but just allow
      failure.  If it was the last device it doesn't fail of
      course, so the retry happens on the same device - this time
      without FAILFAST.  A subsequent failure will not retry but
      will just pass up the error.
      
      During resync we may use FAILFAST requests and on a failure
      we will simply use the other device(s).
      
      During recovery we will only use FAILFAST in the unusual
      case were there are multiple places to read from - i.e. if
      there are > 2 devices.  If we get a failure we will fail the
      device and complete the resync/recovery with remaining
      devices.
      
      The new R1BIO_FailFast flag is set on read reqest to suggest
      the a FAILFAST request might be acceptable.  The rdev needs
      to have FailFast set as well for the read to actually use
      REQ_FAILFAST_*.
      
      We need to know there are at least two working devices
      before we can set R1BIO_FailFast, so we mustn't stop looking
      at the first device we find.  So the "min_pending == 0"
      handling to not exit early, but too always choose the
      best_pending_disk if min_pending == 0.
      
      The spinlocked region in raid1_error() in enlarged to ensure
      that if two bios, reading from two different devices, fail
      at the same time, then there is no risk that both devices
      will be marked faulty, leaving zero "In_sync" devices.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      2e52d449
    • NeilBrown's avatar
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown authored
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      46533ff7
    • NeilBrown's avatar
      md/failfast: add failfast flag for md to be used by some personalities. · 688834e6
      NeilBrown authored
      This patch just adds a 'failfast' per-device flag which can be stored
      in v0.90 or v1.x metadata.
      The flag is not used yet but the intent is that it can be used for
      mirrored (raid1/raid10) arrays where low latency is more important
      than keeping all devices on-line.
      
      Setting the flag for a device effectively gives permission for that
      device to be marked as Faulty and excluded from the array on the first
      error.  The underlying driver will be directed not to retry requests
      that result in failures.  There is a proviso that the device must not
      be marked faulty if that would cause the array as a whole to fail, it
      may only be marked Faulty if the array remains functional, but is
      degraded.
      
      Failures on read requests will cause the device to be marked
      as Faulty immediately so that further reads will avoid that
      device.  No attempt will be made to correct read errors by
      over-writing with the correct data.
      
      It is expected that if transient errors, such as cable unplug, are
      possible, then something in user-space will revalidate failed
      devices and re-add them when they appear to be working again.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      688834e6
  2. 19 Nov, 2016 1 commit
    • Song Liu's avatar
      md/r5cache: handle FLUSH and FUA · 3bddb7f8
      Song Liu authored
      With raid5 cache, we committing data from journal device. When
      there is flush request, we need to flush journal device's cache.
      This was not needed in raid5 journal, because we will flush the
      journal before committing data to raid disks.
      
      This is similar to FUA, except that we also need flush journal for
      FUA. Otherwise, corruptions in earlier meta data will stop recovery
      from reaching FUA data.
      
      slightly changed the code by Shaohua
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      3bddb7f8
  3. 18 Nov, 2016 13 commits
    • Song Liu's avatar
      md/r5cache: r5cache recovery: part 2 · 5aabf7c4
      Song Liu authored
      1. In previous patch, we:
            - add new data to r5l_recovery_ctx
            - add new functions to recovery write-back cache
         The new functions are not used in this patch, so this patch does not
         change the behavior of recovery.
      
      2. In this patchpatch, we:
            - modify main recovery procedure r5l_recovery_log() to call new
              functions
            - remove old functions
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      5aabf7c4
    • Song Liu's avatar
      md/r5cache: r5cache recovery: part 1 · b4c625c6
      Song Liu authored
      Recovery of write-back cache has different logic to write-through only
      cache. Specifically, for write-back cache, the recovery need to scan
      through all active journal entries before flushing data out. Therefore,
      large portion of the recovery logic is rewritten here.
      
      To make the diffs cleaner, we split the rewrite as follows:
      
      1. In this patch, we:
            - add new data to r5l_recovery_ctx
            - add new functions to recovery write-back cache
         The new functions are not used in this patch, so this patch does not
         change the behavior of recovery.
      
      2. In next patch, we:
            - modify main recovery procedure r5l_recovery_log() to call new
              functions
            - remove old functions
      
      With cache feature, there are 2 different scenarios of recovery:
      1. Data-Parity stripe: a stripe with complete parity in journal.
      2. Data-Only stripe: a stripe with only data in journal (or partial
         parity).
      
      The code differentiate Data-Parity stripe from Data-Only stripe with
      flag STRIPE_R5C_CACHING.
      
      For Data-Parity stripes, we use the same procedure as raid5 journal,
      where all the data and parity are replayed to the RAID devices.
      
      For Data-Only strips, we need to finish complete calculate parity and
      finish the full reconstruct write or RMW write. For simplicity, in
      the recovery, we load the stripe to stripe cache. Once the array is
      started, the stripe cache state machine will handle these stripes
      through normal write path.
      
      r5c_recovery_flush_log contains the main procedure of recovery. The
      recovery code first scans through the journal and loads data to
      stripe cache. The code keeps tracks of all these stripes in a list
      (use sh->lru and ctx->cached_list), stripes in the list are
      organized in the order of its first appearance on the journal.
      During the scan, the recovery code assesses each stripe as
      Data-Parity or Data-Only.
      
      During scan, the array may run out of stripe cache. In these cases,
      the recovery code will also call raid5_set_cache_size to increase
      stripe cache size. If the array still runs out of stripe cache
      because there isn't enough memory, the array will not assemble.
      
      At the end of scan, the recovery code replays all Data-Parity
      stripes, and sets proper states for Data-Only stripes. The recovery
      code also increases seq number by 10 and rewrites all Data-Only
      stripes to journal. This is to avoid confusion after repeated
      crashes. More details is explained in raid5-cache.c before
      r5c_recovery_rewrite_data_only_stripes().
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      b4c625c6
    • Song Liu's avatar
      md/r5cache: refactoring journal recovery code · 9ed988f5
      Song Liu authored
      1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block();
      2. pull the code that initialize r5l_meta_block from
         r5l_log_write_empty_meta_block() to a separate function
         r5l_recovery_create_empty_meta_block(), so that we can reuse this
         piece of code.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      9ed988f5
    • Song Liu's avatar
      md/r5cache: sysfs entry journal_mode · 2c7da14b
      Song Liu authored
      With write cache, journal_mode is the knob to switch between
      write-back and write-through.
      
      Below is an example:
      
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      [write-through] write-back
      root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      write-through [write-back]
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      2c7da14b
    • Song Liu's avatar
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu authored
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      a39f7afd
    • Song Liu's avatar
      md/r5cache: caching phase of r5cache · 1e6d690b
      Song Liu authored
      As described in previous patch, write back cache operates in two
      phases: caching and writing-out. The caching phase works as:
      1. write data to journal
         (r5c_handle_stripe_dirtying, r5c_cache_data)
      2. call bio_endio
         (r5c_handle_data_cached, r5c_return_dev_pending_writes).
      
      Then the writing-out phase is as:
      1. Mark the stripe as write-out (r5c_make_stripe_write_out)
      2. Calcualte parity (reconstruct or RMW)
      3. Write parity (and maybe some other data) to journal device
      4. Write data and parity to RAID disks
      
      This patch implements caching phase. The cache is integrated with
      stripe cache of raid456. It leverages code of r5l_log to write
      data to journal device.
      
      Writing-out phase of the cache is implemented in the next patch.
      
      With r5cache, write operation does not wait for parity calculation
      and write out, so the write latency is lower (1 write to journal
      device vs. read and then write to raid disks). Also, r5cache will
      reduce RAID overhead (multipile IO due to read-modify-write of
      parity) and provide more opportunities of full stripe writes.
      
      This patch adds 2 flags to stripe_head.state:
       - STRIPE_R5C_PARTIAL_STRIPE,
       - STRIPE_R5C_FULL_STRIPE,
      
      Instead of inactive_list, stripes with cached data are tracked in
      r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
      STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
      stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
      are not considered as "active".
      
      For RMW, the code allocates an extra page for each data block
      being updated.  This is stored in r5dev->orig_page and the old data
      is read into it.  Then the prexor calculation subtracts ->orig_page
      from the parity block, and the reconstruct calculation adds the
      ->page data back into the parity block.
      
      r5cache naturally excludes SkipCopy. When the array has write back
      cache, async_copy_data() will not skip copy.
      
      There are some known limitations of the cache implementation:
      
      1. Write cache only covers full page writes (R5_OVERWRITE). Writes
         of smaller granularity are write through.
      2. Only one log io (sh->log_io) for each stripe at anytime. Later
         writes for the same stripe have to wait. This can be improved by
         moving log_io to r5dev.
      3. With writeback cache, read path must enter state machine, which
         is a significant bottleneck for some workloads.
      4. There is no per stripe checkpoint (with r5l_payload_flush) in
         the log, so recovery code has to replay more than necessary data
         (sometimes all the log from last_checkpoint). This reduces
         availability of the array.
      
      This patch includes a fix proposed by ZhengYuan Liu
      <liuzhengyuan@kylinos.cn>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      1e6d690b
    • Song Liu's avatar
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu authored
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      2ded3703
    • Song Liu's avatar
      md/r5cache: move some code to raid5.h · 937621c3
      Song Liu authored
      Move some define and inline functions to raid5.h, so they can be
      used in raid5-cache.c
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      937621c3
    • Song Liu's avatar
      md/r5cache: Check array size in r5l_init_log · c757ec95
      Song Liu authored
      Currently, r5l_write_stripe checks meta size for each stripe write,
      which is not necessary.
      
      With this patch, r5l_init_log checks maximal meta size of the array,
      which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
      If this is too big to fit in one page, r5l_init_log aborts.
      
      With current meta data, r5l_log support raid_disks up to 203.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      c757ec95
    • Shaohua Li's avatar
      md: add blktrace event for writes to superblock · 504634f6
      Shaohua Li authored
      superblock write is an expensive operation. With raid5-cache, it can be called
      regularly. Tracing to help performance debug.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: NeilBrown <neilb@suse.com>
      504634f6
    • NeilBrown's avatar
      md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad
      NeilBrown authored
      Both raid1 and raid10 will sometimes delay handling an IO request,
      such as when resync is happening or there are too many requests queued.
      
      Add some blktrace messsages so we can see when that is happening when
      looking for performance artefacts.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      578b54ad
    • NeilBrown's avatar
      md/bitmap: add blktrace event for writes to the bitmap · 581dbd94
      NeilBrown authored
      We trace wheneven bitmap_unplug() finds that it needs to write
      to the bitmap, or when bitmap_daemon_work() find there is work
      to do.
      
      This makes it easier to correlate bitmap updates with data writes.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      581dbd94
    • NeilBrown's avatar
      md: add block tracing for bio_remapping · 109e3765
      NeilBrown authored
      The block tracing infrastructure (accessed with blktrace/blkparse)
      supports the tracing of mapping bios from one device to another.
      This is currently used when a bio in a partition is mapped to the
      whole device, when bios are mapped by dm, and for mapping in md/raid5.
      Other md personalities do not include this tracing yet, so add it.
      
      When a read-error is detected we redirect the request to a different device.
      This could justifiably be seen as a new mapping for the originial bio,
      or a secondary mapping for the bio that errors.  This patch uses
      the second option.
      
      When md is used under dm-raid, the mappings are not traced as we do
      not have access to the block device number of the parent.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      109e3765
  4. 17 Nov, 2016 1 commit
  5. 10 Nov, 2016 1 commit
  6. 09 Nov, 2016 2 commits
    • NeilBrown's avatar
      md: define mddev flags, recovery flags and r1bio state bits using enums · be306c29
      NeilBrown authored
      This is less error prone than using individual #defines.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      be306c29
    • NeilBrown's avatar
      md/raid1: fix: IO can block resync indefinitely · f2c771a6
      NeilBrown authored
      While performing a resync/recovery, raid1 divides the
      array space into three regions:
       - before the resync
       - at or shortly after the resync point
       - much further ahead of the resync point.
      
      Write requests to the first or third do not need to wait.  Write
      requests to the middle region do need to wait if resync requests are
      pending.
      
      If there are any active write requests in the middle region, resync
      will wait for them.
      
      Due to an accounting error, there is a small range of addresses,
      between conf->next_resync and conf->start_next_window, where write
      requests will *not* be blocked, but *will* be counted in the middle
      region.  This can effectively block resync indefinitely if filesystem
      writes happen repeatedly to this region.
      
      As ->next_window_requests is incremented when the sector is after
        conf->start_next_window + NEXT_NORMALIO_DISTANCE
      the same boundary should be used for determining when write requests
      should wait.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      f2c771a6
  7. 07 Nov, 2016 19 commits