1. 22 Dec, 2011 26 commits
    • NeilBrown's avatar
      md/raid10: preferentially read from replacement device if possible. · abbf098e
      NeilBrown authored
      When reading (for array reads, not for recovery etc) we read from the
      replacement device if it has recovered far enough.
      This requires storing the chosen rdev in the 'r10_bio' so we can make
      sure to drop the ref on the right device when the read finishes.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      abbf098e
    • NeilBrown's avatar
      md/raid10: change read_balance to return an rdev · 96c3fd1f
      NeilBrown authored
      It makes more sense to return an rdev than just an index as
      read_balance() gets a reference to the rdev and so returning
      the pointer make this more idiomatic.
      
      This will be needed in a future patch when we might return
      a 'replacement' rdev instead of the main rdev.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      96c3fd1f
    • NeilBrown's avatar
      md/raid10: prepare data structures for handling replacement. · 69335ef3
      NeilBrown authored
      Allow each slot in the RAID10 to have 2 devices, the want_replacement
      and the replacement.
      
      Also an r10bio to have 2 bios, and for resync/recovery allocate the
      second bio if there are any replacement devices.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      69335ef3
    • NeilBrown's avatar
      md/raid5: Mark device want_replacement when we see a write error. · 3a6de292
      NeilBrown authored
      Now that WantReplacement drives are replaced cleanly, mark a drive
      as WantReplacement when we see a write error.  It might get failed soon so
      the WantReplacement flag is irrelevant, but if the write error is recorded
      in the bad block log, we still want to activate any spare that might
      be available.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3a6de292
    • NeilBrown's avatar
      md/raid5: If there is a spare and a want_replacement device, start replacement. · 7bfec5f3
      NeilBrown authored
      When attempting to add a spare to a RAID[456] array, also consider
      adding it as a replacement for a want_replacement device.
      
      This requires that common md code attempt hot_add even when the array
      is not formally degraded.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7bfec5f3
    • NeilBrown's avatar
      md/raid5: recognise replacements when assembling array. · 17045f52
      NeilBrown authored
      If a Replacement is seen, file it as such.
      
      If we see two replacements (or two normal devices) for the one slot,
      abort.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      17045f52
    • NeilBrown's avatar
      md/raid5: handle activation of replacement device when recovery completes. · dd054fce
      NeilBrown authored
      When recovery completes - as reported by a call to ->spare_active,
      we clear In_sync on the original and set it on the replacement.
      
      Then when the original gets removed we move the replacement from
      'replacement' to 'rdev'.
      
      This could race with other code that is looking at these pointers,
      so we use memory barriers and careful ordering to ensure that
      a reader might see one device twice, but never no devices.
      Then the readers guard against using both devices, which could
      only happen when writing.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dd054fce
    • NeilBrown's avatar
      md/raid5: detect and handle replacements during recovery. · 9a3e1101
      NeilBrown authored
      During recovery we want to write to the replacement but not
      the original.  So we have two new flags
       - R5_NeedReplace if this stripe has a replacement that needs to
         be written at some stage
       - R5_WantReplace if NeedReplace, and the data is available, and
         a 'sync' has been requested on this stripe.
      
      We also distinguish between 'sync and replace' which need to read
      all other devices, and 'replace' which only needs to read the
      devices being replaced.
      
      Note that during resync we always write to any replacement device.
      It might not need to be written to, but as we don't read to compare,
      we have to write to be sure.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9a3e1101
    • NeilBrown's avatar
      md/raid5: writes should get directed to replacement as well as original. · 977df362
      NeilBrown authored
      When writing, we need to submit two writes, one to the original, and
      one to the replacement - if there is a replacement.
      
      If the write to the replacement results in a write error, we just fail
      the device.  We only try to record write errors to the original.
      
      When writing for recovery, we shouldn't write to the original.  This
      will be addressed in a subsequent patch that generally addresses
      recovery.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      977df362
    • NeilBrown's avatar
      md/raid5: allow removal for failed replacement devices. · 657e3e4d
      NeilBrown authored
      Enhance raid5_remove_disk to be able to remove ->replacement
      as well as ->rdev.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      657e3e4d
    • NeilBrown's avatar
      md/raid5: preferentially read from replacement device if possible. · 14a75d3e
      NeilBrown authored
      If a replacement device is present and has been recovered far enough,
      then use it for reading into the stripe cache.
      
      If we get an error we don't try to repair it, we just fail the device.
      A replacement device that gives errors does not sound sensible.
      
      This requires removing the setting of R5_ReadError when we get
      a read error during a read that bypasses the cache.  It was probably
      a bad idea anyway as we don't know that every block in the read
      caused an error, and it could cause ReadError to be set for the
      replacement device, which is bad.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      14a75d3e
    • NeilBrown's avatar
      md/raid5: remove redundant bio initialisations. · 995c4275
      NeilBrown authored
      We current initialise some fields of a bio when preparing a
      stripe_head, and again just before submitting the request.
      
      Remove the duplication by only setting the fields that lower level
      devices don't touch in raid5_build_block, and only set the changeable
      fields in ops_run_io.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      995c4275
    • NeilBrown's avatar
      md/raid5: raid5.h cleanup · ede7ee8b
      NeilBrown authored
      Remove some #defines that are no longer used, and replace some
      others with an enum.
      And remove an unused field.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ede7ee8b
    • NeilBrown's avatar
      md/raid5: allow each slot to have an extra replacement device · 671488cc
      NeilBrown authored
      Just enhance data structures to record a second device per slot to be
      used as a 'replacement' device, replacing the original.
      We also have a second bio in each slot in each stripe_head.  This will
      only be used when writing to the array - we need to write to both the
      original and the replacement at the same time, so will need two bios.
      
      For now, only try using the replacement drive for aligned-reads.
      In this case, we prefer the replacement if it has been recovered far
      enough, otherwise use the original.
      
      This includes a small enhancement.  Previously we would only do
      aligned reads if the target device was fully recovered.  Now we also
      do them if it has recovered far enough.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      671488cc
    • NeilBrown's avatar
      md: create externally visible flags for supporting hot-replace. · 2d78f8c4
      NeilBrown authored
      hot-replace is a feature being added to md which will allow a
      device to be replaced without removing it from the array first.
      
      With hot-replace a spare can be activated and recovery can start while
      the original device is still in place, thus allowing a transition from
      an unreliable device to a reliable device without leaving the array
      degraded during the transition.  It can also be use when the original
      device is still reliable but it not wanted for some reason.
      
      This will eventually be supported in RAID4/5/6 and RAID10.
      
      This patch adds a super-block flag to distinguish the replacement
      device.  If an old kernel sees this flag it will reject the device.
      
      It also adds two per-device flags which are viewable and settable via
      sysfs.
         "want_replacement" can be set to request that a device be replaced.
         "replacement" is set to show that this device is replacing another
         device.
      
      The "rd%d" links in /sys/block/mdXx/md only apply to the original
      device, not the replacement.  We currently don't make links for the
      replacement - there doesn't seem to be a need.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      2d78f8c4
    • NeilBrown's avatar
      md: change hot_remove_disk to take an rdev rather than a number. · b8321b68
      NeilBrown authored
      Soon an array will be able to have multiple devices with the
      same raid_disk number (an original and a replacement).  So removing
      a device based on the number won't work.  So pass the actual device
      handle instead.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b8321b68
    • NeilBrown's avatar
      md: remove test for duplicate device when setting slot number. · 476a7abb
      NeilBrown authored
      When setting the slot number on a device in an active array we
      currently check that the number is not already in use.
      We then call into the personality's hot_add_disk function
      which performs the same test and returns the same error.
      
      Thus the common test is not needed.
      
      As we will shortly be changing some personalities to allow duplicates
      in some cases (to support hot-replace), the common test will become
      inconvenient.
      
      So remove the common test.
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      476a7abb
    • NeilBrown's avatar
      md/bitmap: be more consistent when setting new bits in memory bitmap. · 915c420d
      NeilBrown authored
      For each active region corresponding to a bit in the bitmap with have
      a 14bit counter (and some flags).
      This counts
         number of active writes + bit in the on-disk bitmap + delay-needed.
      
      The "delay-needed" is because we always want a delay before clearing a
      bit.  So the number here is normally number of active writes plus 2.
      If there have been no writes for a while, we drop to 1.
      If still no writes we clear the bit and drop to 0.
      
      So for consistency, when setting bit from the on-disk bitmap or by
      request from user-space it is best to set the counter to '2' to start
      with.
      
      In particular we might also set the NEEDED_MASK flag at this time, and
      in all other cases NEEDED_MASK is only set when the counter is 2 or
      more.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      915c420d
    • Steven Rostedt's avatar
      md: Fix userspace free_pages() macro · 38059ec2
      Steven Rostedt authored
      While using etags to find free_pages(), I stumbled across this debug
      definition of free_pages() that is to be used while debugging some raid
      code in userspace. The __get_free_pages() allocates the correct size,
      but the free_pages() does not match. free_pages(), like
      __get_free_pages(), takes an order and not a size.
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      38059ec2
    • NeilBrown's avatar
      md/raid5: be more thorough in calculating 'degraded' value. · 908f4fbd
      NeilBrown authored
      When an array is being reshaped to change the number of devices,
      the two halves can be differently degraded.  e.g. one could be
      missing a device and the other not.
      
      So we need to be more careful about calculating the 'degraded'
      attribute.
      
      Instead of just inc/dec at appropriate times, perform a full
      re-calculation examining both possible cases.  This doesn't happen
      often so it not a big cost, and we already have most of the code to
      do it.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      908f4fbd
    • NeilBrown's avatar
      md/bitmap: daemon_work cleanup. · 2e61ebbc
      NeilBrown authored
      We have a variable 'mddev' in this function, but repeatedly get the
      same value by dereferencing bitmap->mddev.
      There is room for simplification here...
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      2e61ebbc
    • NeilBrown's avatar
      md: allow non-privileged uses to GET_*_INFO about raid arrays. · 506c9e44
      NeilBrown authored
      The info is already available in /proc/mdstat and /sys/block in
      an accessible form so there is no point in putting a road-block in
      the ioctl for information gathering.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      506c9e44
    • NeilBrown's avatar
      md/bitmap: It is OK to clear bits during recovery. · 961902c0
      NeilBrown authored
      commit d0a4bb49 introduced a
      regression which is annoying but fairly harmless.
      
      When writing to an array that is undergoing recovery (a spare
      in being integrated into the array), writing to the array will
      set bits in the bitmap, but they will not be cleared when the
      write completes.
      
      For bits covering areas that have not been recovered yet this is not a
      problem as the recovery will clear the bits.  However bits set in
      already-recovered region will stay set and never be cleared.
      This doesn't risk data integrity.  The only negatives are:
       - next time there is a crash, more resyncing than necessary will
         be done.
       - the bitmap doesn't look clean, which is confusing.
      
      While an array is recovering we don't want to update the
      'events_cleared' setting in the bitmap but we do still want to clear
      bits that have very recently been set - providing they were written to
      the recovering device.
      
      So split those two needs - which previously both depended on 'success'
      and always clear the bit of the write went to all devices.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      961902c0
    • NeilBrown's avatar
      md: don't give up looking for spares on first failure-to-add · 60fc1370
      NeilBrown authored
      Before performing a recovery we try to remove any spares that
      might not be working, then add any that might have become relevant.
      
      Currently we abort on the first spare that cannot be added.
      This is a false optimisation.
      It is conceivable that - depending on rules in the personality - a
      subsequent spare might be accepted.
      Also the loop does other things like count the available spares and
      reset the 'recovery_offset' value.
      
      If we abort early these might not happen properly.
      
      So remove the early abort.
      
      In particular if you have an array what is undergoing recovery and
      which has extra spares, then the recovery may not restart after as
      reboot as the could of 'spares' might end up as zero.
      Reported-by: default avatarAnssi Hannula <anssi.hannula@iki.fi>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      60fc1370
    • NeilBrown's avatar
      md/raid5: ensure correct assessment of drives during degraded reshape. · 30d7a483
      NeilBrown authored
      While reshaping a degraded array (as when reshaping a RAID0 by first
      converting it to a degraded RAID4) we currently get confused about
      which devices are in_sync.  In most cases we get it right, but in the
      region that is being reshaped we need to treat non-failed devices as
      in-sync when we have the data but haven't actually written it out yet.
      Reported-by: default avatarAdam Kwolek <adam.kwolek@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      30d7a483
    • NeilBrown's avatar
      md/linear: fix hot-add of devices to linear arrays. · 09cd9270
      NeilBrown authored
      commit d70ed2e4
      broke hot-add to a linear array.
      After that commit, metadata if not written to devices until they
      have been fully integrated into the array as determined by
      saved_raid_disk.  That patch arranged to clear that field after
      a recovery completed.
      
      However for linear arrays, there is no recovery - the integration is
      instantaneous.  So we need to explicitly clear the saved_raid_disk
      field.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      09cd9270
  2. 09 Dec, 2011 1 commit
  3. 08 Dec, 2011 5 commits
    • NeilBrown's avatar
      md/raid5: never wait for bad-block acks on failed device. · 9283d8c5
      NeilBrown authored
      Once a device is failed we really want to completely ignore it.
      It should go away soon anyway.
      
      In particular the presence of bad blocks on it should not cause us to
      block as we won't be trying to write there anyway.
      
      So as soon as we can check if a device is Faulty, do so and pretend
      that it is already gone if it is Faulty.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9283d8c5
    • NeilBrown's avatar
      md: ensure new badblocks are handled promptly. · 8bd2f0a0
      NeilBrown authored
      When we mark blocks as bad we need them to be acknowledged by the
      metadata handler promptly.
      
      For an in-kernel metadata handler that was already being done.  But
      for an external metadata handler we need to alert it of the change by
      sending a notification through the sysfs file.  This adds that
      notification.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8bd2f0a0
    • NeilBrown's avatar
      md: bad blocks shouldn't cause a Blocked status on a Faulty device. · 52c64152
      NeilBrown authored
      Once a device is marked Faulty the badblocks - whether acknowledged or
      not - become irrelevant.  So they shouldn't cause the device to be
      marked as Blocked.
      
      Without this patch, a process might write "-blocked" to clear the
      Blocked status, but while that will correctly fail the device, it
      won't remove the apparent 'blocked' status.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      52c64152
    • NeilBrown's avatar
      md: take a reference to mddev during sysfs access. · af8a2434
      NeilBrown authored
      
      When we are accessing an mddev via sysfs we know that the
      mddev cannot disappear because it has an embedded kobj which
      is refcounted by sysfs.
      And we also take the mddev_lock.
      However this is not enough.
      
      The final mddev_put could have been called and the
      mddev_delayed_delete is waiting for sysfs to let go so it can destroy
      the kobj and mddev.
      In this state there are a lot of changes that should not be attempted.
      
      To to guard against this we:
       - initialise mddev->all_mddevs in on last put so the state can be
         easily detected.
       - in md_attr_show and md_attr_store, check ->all_mddevs under
         all_mddevs_lock and mddev_get the mddev if it still appears to
         be active.
      
      This means that if we get to sysfs as the mddev is being deleted we
      will get -EBUSY.
      
      rdev_attr_store and rdev_attr_show are similar but already have
      sufficient protection.  They check that rdev->mddev still points to
      mddev after taking mddev_lock.  As this is cleared  before delayed
      removal which can only be requested under the mddev_lock, this
      ensure the rdev and mddev are still alive.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      af8a2434
    • NeilBrown's avatar
      md: refine interpretation of "hold_active == UNTIL_IOCTL". · 1d23f178
      NeilBrown authored
      We like md devices to disappear when they really are not needed.
      However it is not possible to tell from the current state whether it
      is needed or not.  We can only tell from recent history of changes.
      
      In particular immediately after we create an md device it looks very
      similar to immediately after we have finished with it.
      
      So we always preserve a newly created md device until something
      significant happens.  This state is stored in 'hold_active'.
      
      The normal case is to keep it until an ioctl happens, as that will
      normally either activate it, or explicitly de-activate it.  If it
      doesn't then it was probably created by mistake and it is now time to
      get rid of it.
      
      We can also modify an array via sysfs (instead of via ioctl) and we
      currently treat any change via sysfs like an ioctl as a sign that if
      it now isn't more active, it should be destroyed.
      However this is not appropriate as changes made via sysfs are more
      gradual so we should look for a more definitive change.
      
      So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
      the array_state is changed via sysfs.  Other changes via sysfs
      are ignored.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      1d23f178
  4. 22 Nov, 2011 1 commit
    • NeilBrown's avatar
      md/lock: ensure updates to page_attrs are properly locked. · 7c8f4247
      NeilBrown authored
      Page attributes are set using __set_bit rather than set_bit as
      it normally called under a spinlock so the extra atomicity is not
      needed.
      
      However there are two places where we might set or clear page
      attributes without holding the spinlock.
      So add the spinlock in those cases.
      
      This might be the cause of occasional reports that bits a aren't
      getting clear properly - theory is that BITMAP_PAGE_PENDING gets lost
      when BITMAP_PAGE_NEEDWRITE is set or cleared.  This is an
      inconvenience, not a threat to data safety.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7c8f4247
  5. 08 Nov, 2011 4 commits
  6. 07 Nov, 2011 3 commits
    • Al Viro's avatar
      VFS: we need to set LOOKUP_JUMPED on mountpoint crossing · a3fbbde7
      Al Viro authored
      Mountpoint crossing is similar to following procfs symlinks - we do
      not get ->d_revalidate() called for dentry we have arrived at, with
      unpleasant consequences for NFS4.
      
      Simple way to reproduce the problem in mainline:
      
          cat >/tmp/a.c <<'EOF'
          #include <unistd.h>
          #include <fcntl.h>
          #include <stdio.h>
          main()
          {
                  struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1};
                  if (fcntl(0, F_SETLK, &fl))
                          perror("setlk");
          }
          EOF
          cc /tmp/a.c -o /tmp/test
      
      then on nfs4:
      
          mount --bind file1 file2
          /tmp/test < file1		# ok
          /tmp/test < file2		# spews "setlk: No locks available"...
      
      What happens is the missing call of ->d_revalidate() after mountpoint
      crossing and that's where NFS4 would issue OPEN request to server.
      
      The fix is simple - treat mountpoint crossing the same way we deal with
      following procfs-style symlinks.  I.e.  set LOOKUP_JUMPED...
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3fbbde7
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 54a0f913
      Linus Torvalds authored
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf top: Fix live annotation in the --stdio interface
        perf top tui: Don't recalc column widths considering just the first page
        perf report: Add progress bar when processing time ordered events
        perf hists browser: Warn about lost events
        perf tools: Fix a typo of command name as trace-cmd
        perf hists: Fix recalculation of total_period when sorting entries
        perf header: Fix build on old systems
        perf ui browser: Handle K_RESIZE in dialog windows
        perf ui browser: No need to switch char sets that often
        perf hists browser: Use K_TIMER
        perf ui: Rename ui__warning_paranoid to ui__error_paranoid
        perf ui: Reimplement the popup windows using libslang
        perf ui: Reimplement ui__popup_menu using ui__browser
        perf ui: Reimplement ui_helpline using libslang
        perf ui: Improve handling sigwinch a bit
        perf ui progress: Reimplement using slang
        perf evlist: Fix grouping of multiple events
      54a0f913
    • Tony Lindgren's avatar
      d30cc16c