Commits · bd18f6462f3d167a9b3ec27851c98f82694b2adf · nexedi / linux

01 Nov, 2015 9 commits

md: skip resync for raid array with journal · bd18f646

Shaohua Li authored Sep 02, 2015

If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

bd18f646

raid5-cache: optimize FLUSH IO with log enabled · 828cbe98

Shaohua Li authored Sep 02, 2015

With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

828cbe98

raid5-cache: move functionality out of __r5l_set_io_unit_state · 509ffec7

Christoph Hellwig authored Sep 02, 2015

Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

509ffec7

raid5-cache: fix a user-after-free bug · 0fd22b45

Shaohua Li authored Sep 02, 2015

r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

0fd22b45

raid5-cache: switching to state machine for log disk cache flush · a8c34f91

Shaohua Li authored Sep 02, 2015

Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

a8c34f91

raid5: enable log for raid array with cache disk · 5c7e81c3

Shaohua Li authored Aug 13, 2015

Now log is safe to enable for raid array with cache disk
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

5c7e81c3

raid5: don't allow resize/reshape with cache(log) support · 713cf5a6

Shaohua Li authored Aug 13, 2015

If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

713cf5a6

raid5: disable batch with log enabled · 9c3e333d

Shaohua Li authored Aug 13, 2015

With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

9c3e333d

raid5-cache: use crc32c checksum · 5cb2fbd6

Shaohua Li authored Oct 28, 2015

crc32c has lower overhead with cpu acceleration. It's a shame I didn't
use it in first post, sorry. This changes disk format, but we are still
ok in current stage.

V2: delete unnecessary type conversion as pointed out by Bart
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

5cb2fbd6

24 Oct, 2015 13 commits

raid5: log recovery · 355810d1

Shaohua Li authored Aug 13, 2015

This is the log recovery support. The process is quite straightforward.
We scan the log and read all valid meta/data/parity into memory. If a
stripe's data/parity checksum is correct, the stripe will be recoveried.
Otherwise, it's discarded and we don't scan the log further. The reclaim
process guarantees stripe which starts to be flushed raid disks has
completed data/parity and has correct checksum. To recovery a stripe, we
just copy its data/parity to corresponding raid disks.

The trick thing is superblock update after recovery. we can't let
superblock point to last valid meta block. The log might look like:
| meta 1| meta 2| meta 3|
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
points to meta 1, we write a new valid meta 2n.  If crash happens again,
new recovery will start from meta 1. Since meta 2n is valid, recovery
will think meta 3 is valid, which is wrong.  The solution is we create a
new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
points to meta2.  recovery will not think meta 3 is a valid meta,
because its seq is wrong
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

355810d1

raid5: log reclaim support · 0576b1c6

Shaohua Li authored Aug 13, 2015

This is the reclaim support for raid5 log. A stripe write will have
following steps:

1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused

In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.

It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.

Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.

Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.

There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.

The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.

Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

0576b1c6

raid5: add basic stripe log · f6bed0ef

Shaohua Li authored Aug 13, 2015

This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.

The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.

For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.

For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.

flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

f6bed0ef

raid5: add a new state for stripe log handling · b70abcb2

Shaohua Li authored Aug 13, 2015

When a stripe finishes construction, we write the stripe to raid in
ops_run_io normally. With log, we do a bunch of other operations before
the stripe is written to raid. Mainly write the stripe to log disk,
flush disk cache and so on. The operations are still driven by raid5d
and run in the stripe state machine. We introduce a new state for such
stripe (trapped into log). The stripe is in this state from the time it
first enters ops_run_io (finish construction) to the time it is written
to raid. Since we know the state is only for log, we bypass other
check/operation in handle_stripe.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

b70abcb2

raid5: export some functions · 6d036f7d

Shaohua Li authored Aug 13, 2015

Next several patches use some raid5 functions, rename them with raid5
prefix and export out.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

6d036f7d

md: override md superblock recovery_offset for journal device · 3069aa8d

Shaohua Li authored Aug 13, 2015

Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

3069aa8d

MD: add a new disk role to present write journal device · bac624f3

Song Liu authored Aug 13, 2015

Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

bac624f3

MD: replace special disk roles with macros · c4d4c91b

Song Liu authored Aug 13, 2015

Add the following two macros for special roles: spare and faulty

MD_DISK_ROLE_SPARE	0xffff
MD_DISK_ROLE_FAULTY	0xfffe

Add MD_DISK_ROLE_MAX	0xff00 as the maximal possible regular role,
and minimal value of special role.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

c4d4c91b

md-cluster: Call update_raid_disks() if another node --grow's raid_disks · 28c1b9fd

Goldwyn Rodrigues authored Oct 22, 2015

To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.

This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.

In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.

mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>

28c1b9fd

md-cluster: remove mddev arg from add_resync_info() · 30661b49
NeilBrown authored Oct 19, 2015
```
The arg isn't used, so its presence is only confusing.
Signed-off-by: NeilBrown <neilb@suse.com>
```
30661b49

md-cluster: don't cast void pointers when assigning them. · 2e2a7cd9

NeilBrown authored Oct 19, 2015

It is common practice in the kernel to leave out this case.
It isn't needed and adds little if any value.
Signed-off-by: NeilBrown <neilb@suse.com>

2e2a7cd9

md-cluster: discard unused sb_mutex. · 82381523
NeilBrown authored Oct 19, 2015
```
Signed-off-by: NeilBrown <neilb@suse.com>
```
82381523

md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__ · cf97a348

Guoqing Jiang authored Oct 16, 2015

This patches fixes sparse warnings like incorrect type in assignment
(different base types), cast to restricted __le64.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>

cf97a348

16 Oct, 2015 1 commit

md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as le32 · ba2746b0

NeilBrown authored Oct 16, 2015

As cmsg.raid_slot is le32, comparing for >0 is not meaningful.

So introduce cpu-endian 'raid_slot' and only assign to cmsg.raid_slot
when we know value is valid.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>

ba2746b0

13 Oct, 2015 1 commit

Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next · c2a06c38

NeilBrown authored Oct 14, 2015

md-cluster: A better way for METADATA_UPDATED processing

The processing of METADATA_UPDATED message is too simple and prone to
errors. Besides, it would not update the internal data structures as
required.

This set of patches reads the superblock from one of the device of the MD
and checks for changes in the in-memory data structures. If there is a change,
it performs the necessary actions to keep the internal data structures
as it would be in the primary node.

An example is if a devices turns faulty. The algorithm is:

1. The initiator node marks the device as faulty and updates the superblock
2. The initiator node sends METADATA_UPDATED with an advisory  device number to the rest of the nodes.
3. The receiving node on receiving the METADATA_UPDATED message
  3.1 Reads the superblock
  3.2 Detects a device has failed by comparing with memory structure
  3.3 Calls the necessary functions to record the failure and get the device out of the active array.
  3.4 Acknowledges the message.

The patch series also fixes adding the disk which was impacted because of
the changes.

Patches can also be found at
https://github.com/goldwynr/linux branch md-next

Changes since V2:
 - Fix status synchrnoization after --add and --re-add operations
 - Included Guoqing's patches on endian correctness, zeroing cmsg etc
 - Restructure add_new_disk() and cancel()

c2a06c38

12 Oct, 2015 16 commits

md: check the return value for metadata_update_start · 23b63f9f

Guoqing Jiang authored Oct 12, 2015

We shouldn't run related funs of md_cluster_ops in case
metadata_update_start returned failure.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>

23b63f9f

md-cluster: only call kick_rdev_from_array after remove disk successfully · a9720903

Guoqing Jiang authored Oct 12, 2015

For cluster raid, we should not kick it from array if the disk can't be
remove from array successfully.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

a9720903

md-cluster: Add 'SUSE' as author for md-cluster.c · 86b57277
Guoqing Jiang authored Oct 12, 2015
```
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
```
86b57277
md-cluster: zero cmsg before it was sent · aee177ac
Guoqing Jiang authored Oct 12, 2015
```
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
```
aee177ac

md-cluster: make sure the node do not receive it's own msg · 256f5b24

Guoqing Jiang authored Oct 12, 2015

During the past test, the node occasionally received the msg which is
sent from itself, this case should not happen in theory, but it is
better to avoid it in case something wrong happened.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

256f5b24

md-cluster: remove unnecessary setting for slot · 487cf914

Guoqing Jiang authored Oct 12, 2015

Since slot will be set within _sendmsg, we can remove
the redundant code in resync_info_update.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>

487cf914

md-cluster: make other members of cluster_msg is handled by little endian funcs · faeff83f
Guoqing Jiang authored Oct 12, 2015
```
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
```
faeff83f

md-cluster: Do not printk() every received message · d216711b

Goldwyn Rodrigues authored Oct 12, 2015

The receive daemon prints kernel messages for every network message
received. This would fill the kernel message log with unnecessary messages.
Remove the pr_info() messages.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

d216711b

md-cluster: Fix adding of new disk with new reload code · dbb64f86

Goldwyn Rodrigues authored Oct 01, 2015

Adding the disk worked incorrectly with the new reload code. Fix it:

 - No operation should be performed on rdev marked as Candidate
 - After a metadata update operation, kick disk if role is 0xfffe
   else clear Candidate bit and continue with the regular change check.
 - Saving the mode of the lock resource to check if token lock is already
   locked, because it can be called twice while adding a disk. However,
   unlock_comm() must be called only once.
 - add_new_disk() is called by the node initiating the --add operation.
   If it needs to be canceled, call add_new_disk_cancel(). The operation
   is completed by md_update_sb() which will write and unlock the
   communication.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

dbb64f86

md-cluster: Perform resync/recovery under a DLM lock · c186b128

Goldwyn Rodrigues authored Sep 30, 2015

Resync or recovery must be performed by only one node at a time.
A DLM lock resource, resync_lockres provides the mutual exclusion
so that only one node performs the recovery/resync at a time.

If a node is unable to get the resync_lockres, because recovery is
being performed by another node, it set MD_RECOVER_NEEDED so as
to schedule recovery in the future.

Remove the debug message in resync_info_update()
used during development.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

c186b128

md-cluster: Perform a lazy update · 2aa82191

Goldwyn Rodrigues authored Sep 28, 2015

In a clustered environment, a change such as marking a device faulty,
can be recorded by any of the nodes. This is communicated to all the
nodes and re-recording such a change is unnecessary, and quite often
pretty disruptive.

With this patch, just before the update, we detect for the changes
and if the changes are already in superblock, we abort the update
after clearing all the flags
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

2aa82191

md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb

Goldwyn Rodrigues authored Aug 21, 2015

md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.

Instead, read the superblock of one of the "good" rdevs and update
the necessary information:

- read the superblock into a newly allocated page, by temporarily
  swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
  1. iterates over list of active devices and checks the matching
   dev_roles[] value.
   	If that is 'faulty', the device must be  marked as faulty
	 - call md_error to mark the device as faulty. Make sure
	   not to set CHANGE_DEVS and wakeup mddev->thread or else
	   it would initiate a resync process, which is the responsibility
	   of the "primary" node.
	 - clear the Blocked bit
	 - Call remove_and_add_spares() to hot remove the device.
	If the device is 'spare':
	 - call remove_and_add_spares() to get the number of spares
	   added in this operation.
	 - Reduce mddev->degraded to mark the array as not degraded.
  2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
  is equal to MaxSector, call spare_active() to set it In_sync

This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

70bcecdb

md: remove_and_add_spares() to activate specific rdev · 2910ff17

Goldwyn Rodrigues authored Sep 28, 2015

remove_and_add_spares() checks for all devices to activate spare.
Change it to activate a specific device if a non-null rdev
argument is passed.

remove_and_add_spares() can be used to activate spares in
slot_store() as well.

For hot_remove_disk(), check if rdev->raid_disk == -1 before
calling remove_and_add_spares()
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

2910ff17

md-cluster: Wake up suspended process · b8ca846e

Goldwyn Rodrigues authored Oct 09, 2015

When the suspended_area is deleted, the suspended processes
must be woken up in order to complete their I/O.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

b8ca846e

md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster · 09995411

Guoqing Jiang authored Oct 01, 2015

Previously, BITMAP_NEEDS_SYNC message is sent when the resyc
aborts, but it could abort for different reasons, and not all
of reasons require another node to take over the resync ownship.

It is better make BITMAP_NEEDS_SYNC message only be sent when
the node is leaving cluster with dirty bitmap. And we also need
to ensure dlm connection is ok.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>

09995411

md-cluster: Use a small window for resync · c40f341f

Goldwyn Rodrigues authored Aug 19, 2015

Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
   wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
   list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

c40f341f