1. 01 Nov, 2015 9 commits
  2. 24 Oct, 2015 13 commits
    • Shaohua Li's avatar
      raid5: log recovery · 355810d1
      Shaohua Li authored
      This is the log recovery support. The process is quite straightforward.
      We scan the log and read all valid meta/data/parity into memory. If a
      stripe's data/parity checksum is correct, the stripe will be recoveried.
      Otherwise, it's discarded and we don't scan the log further. The reclaim
      process guarantees stripe which starts to be flushed raid disks has
      completed data/parity and has correct checksum. To recovery a stripe, we
      just copy its data/parity to corresponding raid disks.
      
      The trick thing is superblock update after recovery. we can't let
      superblock point to last valid meta block. The log might look like:
      | meta 1| meta 2| meta 3|
      meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
      points to meta 1, we write a new valid meta 2n.  If crash happens again,
      new recovery will start from meta 1. Since meta 2n is valid, recovery
      will think meta 3 is valid, which is wrong.  The solution is we create a
      new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
      points to meta2.  recovery will not think meta 3 is a valid meta,
      because its seq is wrong
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      355810d1
    • Shaohua Li's avatar
      raid5: log reclaim support · 0576b1c6
      Shaohua Li authored
      This is the reclaim support for raid5 log. A stripe write will have
      following steps:
      
      1. reconstruct the stripe, read data/calculate parity. ops_run_io
      prepares to write data/parity to raid disks
      2. hijack ops_run_io. stripe data/parity is appending to log disk
      3. flush log disk cache
      4. ops_run_io run again and do normal operation. stripe data/parity is
      written in raid array disks. raid core can return io to upper layer.
      5. flush cache of all raid array disks
      6. update super block
      7. log disk space used by the stripe can be reused
      
      In practice, several stripes consist of an io_unit and we will batch
      several io_unit in different steps, but the whole process doesn't
      change.
      
      It's possible io return just after data/parity hit log disk, but then
      read IO will need read from log disk. For simplicity, IO return happens
      at step 4, where read IO can directly read from raid disks.
      
      Currently reclaim run if there is specific reclaimable space (1/4 disk
      size or 10G) or we are out of space. Reclaim is just to free log disk
      spaces, it doesn't impact data consistency. The size based force reclaim
      is to make sure log isn't too big, so recovery doesn't scan log too
      much.
      
      Recovery make sure raid disks and log disk have the same data of a
      stripe. If crash happens before 4, recovery might/might not recovery
      stripe's data/parity depending on if data/parity and its checksum
      matches. In either case, this doesn't change the syntax of an IO write.
      After step 3, stripe is guaranteed recoverable, because stripe's
      data/parity is persistent in log disk. In some cases, log disk content
      and raid disks content of a stripe are the same, but recovery will still
      copy log disk content to raid disks, this doesn't impact data
      consistency. space reuse happens after superblock update and cache
      flush.
      
      There is one situation we want to avoid. A broken meta in the middle of
      a log causes recovery can't find meta at the head of log. If operations
      require meta at the head persistent in log, we must make sure meta
      before it persistent in log too. The case is stripe data/parity is in
      log and we start write stripe to raid disks (before step 4). stripe
      data/parity must be persistent in log before we do the write to raid
      disks. The solution is we restrictly maintain io_unit list order. In
      this case, we only write stripes of an io_unit to raid disks till the
      io_unit is the first one whose data/parity is in log.
      
      The io_unit list order is important for other cases too. For example,
      some io_unit are reclaimable and others not. They can be mixed in the
      list, we shouldn't reuse space of an unreclaimable io_unit.
      
      Includes fixes to problems which were...
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      0576b1c6
    • Shaohua Li's avatar
      raid5: add basic stripe log · f6bed0ef
      Shaohua Li authored
      This introduces a simple log for raid5. Data/parity writing to raid
      array first writes to the log, then write to raid array disks. If
      crash happens, we can recovery data from the log. This can speed up
      raid resync and fix write hole issue.
      
      The log structure is pretty simple. Data/meta data is stored in block
      unit, which is 4k generally. It has only one type of meta data block.
      The meta data block can track 3 types of data, stripe data, stripe
      parity and flush block. MD superblock will point to the last valid
      meta data block. Each meta data block has checksum/seq number, so
      recovery can scan the log correctly. We store a checksum of stripe
      data/parity to the metadata block, so meta data and stripe data/parity
      can be written to log disk together. otherwise, meta data write must
      wait till stripe data/parity is finished.
      
      For stripe data, meta data block will record stripe data sector and
      size. Currently the size is always 4k. This meta data record can be made
      simpler if we just fix write hole (eg, we can record data of a stripe's
      different disks together), but this format can be extended to support
      caching in the future, which must record data address/size.
      
      For stripe parity, meta data block will record stripe sector. It's
      size should be 4k (for raid5) or 8k (for raid6). We always store p
      parity first. This format should work for caching too.
      
      flush block indicates a stripe is in raid array disks. Fixing write
      hole doesn't need this type of meta data, it's for caching extension.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      f6bed0ef
    • Shaohua Li's avatar
      raid5: add a new state for stripe log handling · b70abcb2
      Shaohua Li authored
      When a stripe finishes construction, we write the stripe to raid in
      ops_run_io normally. With log, we do a bunch of other operations before
      the stripe is written to raid. Mainly write the stripe to log disk,
      flush disk cache and so on. The operations are still driven by raid5d
      and run in the stripe state machine. We introduce a new state for such
      stripe (trapped into log). The stripe is in this state from the time it
      first enters ops_run_io (finish construction) to the time it is written
      to raid. Since we know the state is only for log, we bypass other
      check/operation in handle_stripe.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      b70abcb2
    • Shaohua Li's avatar
      raid5: export some functions · 6d036f7d
      Shaohua Li authored
      Next several patches use some raid5 functions, rename them with raid5
      prefix and export out.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      6d036f7d
    • Shaohua Li's avatar
      md: override md superblock recovery_offset for journal device · 3069aa8d
      Shaohua Li authored
      Journal device stores data in a log structure. We need record the log
      start. Here we override md superblock recovery_offset for this purpose.
      This field of a journal device is meaningless otherwise.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      3069aa8d
    • Song Liu's avatar
      MD: add a new disk role to present write journal device · bac624f3
      Song Liu authored
      Next patches will use a disk as raid5/6 journaling. We need a new disk
      role to present the journal device and add MD_FEATURE_JOURNAL to
      feature_map for backward compability.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      bac624f3
    • Song Liu's avatar
      MD: replace special disk roles with macros · c4d4c91b
      Song Liu authored
      Add the following two macros for special roles: spare and faulty
      
      MD_DISK_ROLE_SPARE	0xffff
      MD_DISK_ROLE_FAULTY	0xfffe
      
      Add MD_DISK_ROLE_MAX	0xff00 as the maximal possible regular role,
      and minimal value of special role.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      c4d4c91b
    • Goldwyn Rodrigues's avatar
      md-cluster: Call update_raid_disks() if another node --grow's raid_disks · 28c1b9fd
      Goldwyn Rodrigues authored
      To incorporate --grow feature executed on one node, other nodes need to
      acknowledge the change in number of disks. Call update_raid_disks()
      to update internal data structures.
      
      This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
      this results in a deadlock. This is done so it can safely allocate memory
      (which might trigger writeback which might write to raid1). This is
      not required for md with a bitmap.
      
      In the clustered case, we don't perform md_update_sb() in md_allow_write(),
      but in do_md_run(). Also we disable safemode for clustered mode.
      
      mddev->recovery_cp need not be set in check_sb_changes() because this
      is required only when a node reads another node's bitmap. mddev->recovery_cp
      (which is read from sb->resync_offset), is set only if mddev is in_sync.
      Since we disabled safemode, in_sync is set to zero.
      In a clustered environment, the MD may not be in sync because another
      node could be writing to it. So make sure that in_sync is not set in
      case of clustered node in __md_stop_writes().
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      28c1b9fd
    • NeilBrown's avatar
      md-cluster: remove mddev arg from add_resync_info() · 30661b49
      NeilBrown authored
      The arg isn't used, so its presence is only confusing.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      30661b49
    • NeilBrown's avatar
      md-cluster: don't cast void pointers when assigning them. · 2e2a7cd9
      NeilBrown authored
      It is common practice in the kernel to leave out this case.
      It isn't needed and adds little if any value.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      2e2a7cd9
    • NeilBrown's avatar
      md-cluster: discard unused sb_mutex. · 82381523
      NeilBrown authored
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      82381523
    • Guoqing Jiang's avatar
      md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__ · cf97a348
      Guoqing Jiang authored
      This patches fixes sparse warnings like incorrect type in assignment
      (different base types), cast to restricted __le64.
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      cf97a348
  3. 16 Oct, 2015 1 commit
  4. 13 Oct, 2015 1 commit
    • NeilBrown's avatar
      Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next · c2a06c38
      NeilBrown authored
      md-cluster: A better way for METADATA_UPDATED processing
      
      The processing of METADATA_UPDATED message is too simple and prone to
      errors. Besides, it would not update the internal data structures as
      required.
      
      This set of patches reads the superblock from one of the device of the MD
      and checks for changes in the in-memory data structures. If there is a change,
      it performs the necessary actions to keep the internal data structures
      as it would be in the primary node.
      
      An example is if a devices turns faulty. The algorithm is:
      
      1. The initiator node marks the device as faulty and updates the superblock
      2. The initiator node sends METADATA_UPDATED with an advisory  device number to the rest of the nodes.
      3. The receiving node on receiving the METADATA_UPDATED message
        3.1 Reads the superblock
        3.2 Detects a device has failed by comparing with memory structure
        3.3 Calls the necessary functions to record the failure and get the device out of the active array.
        3.4 Acknowledges the message.
      
      The patch series also fixes adding the disk which was impacted because of
      the changes.
      
      Patches can also be found at
      https://github.com/goldwynr/linux branch md-next
      
      Changes since V2:
       - Fix status synchrnoization after --add and --re-add operations
       - Included Guoqing's patches on endian correctness, zeroing cmsg etc
       - Restructure add_new_disk() and cancel()
      c2a06c38
  5. 12 Oct, 2015 16 commits