Commits · e56108d65f8705170d238858616728359542aebb · nexedi / linux

11 Oct, 2012 28 commits

md/raid5: be careful not to resize_stripes too big. · e56108d6

NeilBrown authored Oct 11, 2012

When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large.  This particularly happens
when md_check_recovery calls ->check_reshape().

If we use ->previous_raid_disks, we don't risk this.
Signed-off-by: NeilBrown <neilb@suse.de>

e56108d6

md: make sure manual changes to recovery checkpoint are saved. · db07d85e

NeilBrown authored Oct 11, 2012

If you make an array bigger but suppress resync of the new region with
  mdadm --grow /dev/mdX --size=max --assume-clean

then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>

db07d85e

md/raid10: use correct limit variable · 91502f09

Dan Carpenter authored Oct 11, 2012

Clang complains that we are assigning a variable to itself.  This should
be using bad_sectors like the similar earlier check does.

Bug has been present since 3.1-rc1.  It is minor but could
conceivably cause corruption or other bad behaviour.

Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>

91502f09

md: writing to sync_action should clear the read-auto state. · 48c26ddc

NeilBrown authored Oct 11, 2012

In some cases array are started in 'read-auto' state where in
nothing gets written to any device until the array is written
to.  The purpose of this is to make accidental auto-assembly
of the wrong arrays less of a risk, and to allow arrays to be
started to read suspend-to-disk images without actually changing
anything (as might happen if the array were dirty and a
resync seemed necessary).

Explicitly writing the 'sync_action' for a read-auto array currently
doesn't clear the read-auto state, so the sync action doesn't
happen, which can be confusing.

So allow any successful write to sync_action to clear any read-auto
state.
Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de>
Signed-off-by: NeilBrown <neilb@suse.de>

48c26ddc

Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races · 7f7583d4

Jianpeng Ma authored Oct 11, 2012

Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7f7583d4

md/raid5: make sure to_read and to_write never go negative. · 1ed850f3

NeilBrown authored Oct 11, 2012

to_read and to_write are part of the result of analysing
a stripe before handling it.
Their use is to avoid some loops and tests if the values are
known to be zero.  Thus it is not a problem if they are a
little bit larger than they should be.

So decrementing them in handle_failed_stripe serves little value, and
due to races it could cause some loops to be skipped incorrectly.

So remove those decrements.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

1ed850f3

md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. · a7854487

Alexander Lyakas authored Oct 11, 2012

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Suggested-by: Yair Hershko <yair@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>

a7854487

md/raid5: protect debug message against NULL derefernce. · b97390ae

NeilBrown authored Oct 11, 2012

The pr_debug in add_stripe_bio could race with something
changing *bip, so it is best to hold the lock until
after the pr_debug.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

b97390ae

md/raid5: add some missing locking in handle_failed_stripe. · 143c4d05

NeilBrown authored Oct 11, 2012

We really should hold the stripe_lock while accessing
'toread' else we could race with add_stripe_bio and corrupt
a list.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

143c4d05

MD: raid5 avoid unnecessary zero page for trim · 9e444768

Shaohua Li authored Oct 11, 2012

We want to avoid zero discarded dev page, because it's useless for discard.
But if we don't zero it, another read/write hit such page in the cache and will
get inconsistent data.

To avoid zero the page, we don't set R5_UPTODATE flag after construction is
done. In this way, discard write request is still issued and finished, but read
will not hit the page. If the stripe gets accessed soon, we need reread the
stripe, but since the chance is low, the reread isn't a big deal.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

9e444768

MD: raid5 trim support · 620125f2

Shaohua Li authored Oct 11, 2012


Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk.  To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.

Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.

The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.

For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard.  Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.

If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.

If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

620125f2

md/bitmap:Don't use IS_ERR to judge alloc_page(). · 582e2e05
Jianpeng Ma authored Oct 11, 2012
```
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
```
582e2e05

md/raid1: Don't release reference to device while handling read error. · 7ad4d4a6

NeilBrown authored Oct 11, 2012

When we get a read error, we arrange for raid1d to handle it.
Currently we release the reference on the device.  This can result
in
   conf->mirrors[read_disk].rdev
being NULL in fix_read_error, if the device happens to get removed
before the read error is handled.

So instead keep the reference until the read error has been fully
handled.
Reported-by: hank <pyu@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7ad4d4a6

raid: replace list_for_each_continue_rcu with new interface · fd177481

Michael Wang authored Oct 11, 2012

This patch replaces list_for_each_continue_rcu() with
list_for_each_entry_continue_rcu() to save a few lines
of code and allow removing list_for_each_continue_rcu().
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>

fd177481

add further __init annotations to crypto/xor.c · af7cf25d

Jan Beulich authored Oct 11, 2012

Allow particularly do_xor_speed() to be discarded post-init.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

af7cf25d

DM RAID: Fix for "sync" directive ineffectiveness · 761becff

Jonathan Brassow authored Oct 11, 2012

There are two table arguments that can be given to a DM RAID target
that control whether the array is forced to (re)synchronize or skip
initialization: "sync" and "nosync".  When "sync" is given, we set
mddev->recovery_cp to 0 in order to cause the device to resynchronize.
This is insufficient if there is a bitmap in use, because the array
will simply look at the bitmap and see that there is no recovery
necessary.

The fix is to skip over the loading of the superblocks when "sync" is
given, causing new superblocks to be written that will force the array
to go through initialization (i.e. synchronization).
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

761becff

DM RAID: Fix comparison of index and quantity for "rebuild" parameter · 7386199c

Jonathan Brassow authored Oct 11, 2012

DM RAID: Fix comparison of index and quantity for "rebuild" parameter

The "rebuild" parameter takes an index argument that starts counting from
zero.  The conditional used to validate the index was using '>' rather than
'>=', leaving the door open for an index value that would be 1 too large.
Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7386199c

DM RAID: Add rebuild capability for RAID10 · 4ec1e369

Jonathan Brassow authored Oct 11, 2012

DM RAID: Add code to validate replacement slots for RAID10 arrays

RAID10 can handle 'copies - 1' failures for each mirror group. This code
ensures the user has provided a valid array - one whose devices specified for
rebuild do not exceed the amount of redundancy available.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

4ec1e369

DM RAID: Move 'rebuild' checking code to its own function · eb649123

Jonathan Brassow authored Oct 11, 2012

DM RAID:  Move chunk of code to it's own function

The code that checks whether device replacements/rebuilds are possible given
a specific RAID type is moved to it's own function.  It will further expand
when the code to check RAID10 is added.  A separate function makes it easier
to read.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

eb649123

MD RAID10: Prep for DM RAID10 device replacement capability · 2863b9eb

Jonathan Brassow authored Oct 11, 2012

MD RAID10:  Fix a couple potential kernel panics if RAID10 is used by dm-raid

When device-mapper uses the RAID10 personality through dm-raid.c, there is no
'gendisk' structure in mddev and some sysfs information is also not populated.

This patch avoids touching those non-existent structures.
Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

2863b9eb

md: avoid taking the mutex on some ioctls. · 1ca69c4b

NeilBrown authored Oct 11, 2012

Some ioctls don't need to take the mutex and doing so can cause
a delay as it is held during super-block update.
So move those ioctls out of the mutex and rely on rcu locking
to ensure we don't access stale data.
Signed-off-by: NeilBrown <neilb@suse.de>

1ca69c4b

MD: change the parameter of md thread · 4ed8731d

Shaohua Li authored Oct 11, 2012

Change the thread parameter, so the thread can carry extra info. Next patch
will use it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

4ed8731d

md/raid10: submit IO from originating thread instead of md thread. · 57c67df4

NeilBrown authored Oct 11, 2012

queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.

So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.

This is nearly identical to a recent patch which provided similar
functionality to RAID1.
Signed-off-by: NeilBrown <neilb@suse.de>

57c67df4

md: raid 10 supports TRIM · 532a2a3f

Shaohua Li authored Oct 11, 2012


This makes md raid 10 support TRIM.

If one disk supports discard and another not, or one has
discard_zero_data and another not, there could be inconsistent between
data from such disks. But this should not matter, discarded data is
useless. This will add extra copy in rebuild though.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

532a2a3f

md: raid 1 supports TRIM · 2ff8cc2c

Shaohua Li authored Oct 11, 2012

This makes md raid 1 support TRIM.
If one disk supports discard and another not, or one has discard_zero_data and
another not, there could be inconsistent between data from such disks. But this
should not matter, discarded data is useless. This will add extra copy in rebuild
though.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

2ff8cc2c

md: raid 0 supports TRIM · c83057a1

Shaohua Li authored Oct 11, 2012

This makes md raid 0 support TRIM.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

c83057a1

md: linear supports TRIM · f1cad2b6

Shaohua Li authored Oct 11, 2012

This makes md linear support TRIM.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

f1cad2b6

md/linear: rcu_dereference outside read-lock section · bc78c573

Denis Efremov authored Oct 11, 2012

According to the comment in linear_stop function
rcu_dereference in linear_start and linear_stop functions
occurs under reconfig_mutex. The patch represents this
agreement in code and prevents lockdep complaint.

Found by Linux Driver Verification project (linuxtesting.org)
Signed-off-by: Denis Efremov <yefremov.denis@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

bc78c573

28 Sep, 2012 2 commits

block: makes bio_split support bio without data · 02f3939e

Shaohua Li authored Sep 28, 2012

discard bio hasn't data attached. We hit a BUG_ON with such bio. This makes
bio_split works for such bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

02f3939e

scatterlist: refactor the sg_nents · 232f1b51

Maxim Levitsky authored Sep 28, 2012

Replace 'while' with 'for' as suggested by Tejun Heo
Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

232f1b51

27 Sep, 2012 2 commits

scatterlist: add sg_nents · 2e484610

Maxim Levitsky authored Sep 27, 2012

Useful helper to know the number of entries in scatterlist.
Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Alex Dubov <oakad@yahoo.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e484610

fs: fix include/percpu-rwsem.h export error · c2b1ad80

Jens Axboe authored Sep 27, 2012

We get the following export error on the include file:

usr/include/linux/fs.h:13: included file 'linux/percpu-rwsem.h' is not exported

Move the include inside the __KERNEL__ section.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c2b1ad80

26 Sep, 2012 4 commits

percpu-rw-semaphore: fix documentation typos · e6b5c082

Mikulas Patocka authored Sep 26, 2012

One more patch for this thing, fixing some typos in the documentation.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e6b5c082

fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared · 3eab7315

Fengguang Wu authored Sep 26, 2012

blkdev_mmap() isn't used outside of fs/block_dev.c, mark it as
static.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3eab7315

blockdev: turn a rw semaphore into a percpu rw semaphore · 62ac665f

Mikulas Patocka authored Sep 26, 2012

This avoids cache line bouncing when many processes lock the semaphore
for read.

New percpu lock implementation

The lock consists of an array of percpu unsigned integers, a boolean
variable and a mutex.

When we take the lock for read, we enter rcu read section, check for a
"locked" variable. If it is false, we increase a percpu counter on the
current cpu and exit the rcu section. If "locked" is true, we exit the
rcu section, take the mutex and drop it (this waits until a writer
finished) and retry.

Unlocking for read just decreases percpu variable. Note that we can
unlock on a difference cpu than where we locked, in this case the
counter underflows. The sum of all percpu counters represents the number
of processes that hold the lock for read.

When we need to lock for write, we take the mutex, set "locked" variable
to true and synchronize rcu. Since RCU has been synchronized, no
processes can create new read locks. We wait until the sum of percpu
counters is zero - when it is, there are no readers in the critical
section.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

62ac665f

Fix a crash when block device is read and block size is changed at the same time · b87570f5

Mikulas Patocka authored Sep 26, 2012

The kernel may crash when block size is changed and I/O is issued
simultaneously.

Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.

The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.

The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.

This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.

For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.

The patch prevents block size changing while the device is mapped with
mmap.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b87570f5

21 Sep, 2012 2 commits

block: fix request_queue->flags initialization · 60ea8226

Tejun Heo authored Sep 20, 2012

A queue newly allocated with blk_alloc_queue_node() has only
QUEUE_FLAG_BYPASS set.  For request-based drivers,
blk_init_allocated_queue() is called and q->queue_flags is overwritten
with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
initial bypass is still in effect.

In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
instead of overwriting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

60ea8226

block: lift the initial queue bypass mode on blk_register_queue() instead of... · 749fefe6

Tejun Heo authored Sep 20, 2012

block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()

b82d4b19 ("blkcg: make request_queue bypassing on allocation") made
request_queues bypassed on allocation to avoid switching on and off
bypass mode on a queue being initialized.  Some drivers allocate and
then destroy a lot of queues without fully initializing them and
incurring bypass latency overhead on each of them could add upto
significant overhead.

Unfortunately, blk_init_allocated_queue() is never used by queues of
bio-based drivers, which means that all bio-based driver queues are in
bypass mode even after initialization and registration complete
successfully.

Due to the limited way request_queues are used by bio drivers, this
problem is hidden pretty well but it shows up when blk-throttle is
used in combination with a bio-based driver.  Trying to configure
(echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
indefinitely in blkg_conf_prep() waiting for bypass mode to end.

This patch moves the initial blk_queue_bypass_end() call from
blk_init_allocated_queue() to blk_register_queue() which is called for
any userland-visible queues regardless of its type.

I believe this is correct because I don't think there is any block
driver which needs or wants working elevator and blk-cgroup on a queue
which isn't visible to userland.  If there are such users, we need a
different solution.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

749fefe6

20 Sep, 2012 2 commits

block: ioctl to zero block ranges · 66ba32dc

Martin K. Petersen authored Sep 18, 2012

Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by
way of blkdev_issue_zeroout().
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

66ba32dc

block: Make blkdev_issue_zeroout use WRITE SAME · 579e8f3c

Martin K. Petersen authored Sep 18, 2012

If the device supports WRITE SAME, use that to optimize zeroing of
blocks. If the device does not support WRITE SAME or if the operation
fails, fall back to writing zeroes the old-fashioned way.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

579e8f3c