Commits · 65b51a009e29e64c0951f21ea17fdc66bbb0fbd7 · nexedi / linux

25 Sep, 2008 40 commits

btrfs_search_slot: reduce lock contention by cowing in two stages · 65b51a00

Chris Mason authored Aug 01, 2008

A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.

The first part needs locks in the extent allocation tree, and may need to
do IO. This changeset splits that into a separate function that can be
called without any tree locks held.

btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block. This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

65b51a00

Btrfs: Throttle less often waiting for snapshots to delete · 18e35e0a
Chris Mason authored Aug 01, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
18e35e0a

Btrfs: Improve and cleanup locking done by walk_down_tree · f87f057b

Chris Mason authored Aug 01, 2008

While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.

It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.

The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time. Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f87f057b

Btrfs: Hold a reference on bios during submit_bio, add some extra bio checks · 492bb6de
Chris Mason authored Jul 31, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
492bb6de
Btrfs: Drop some debugging around the extent_map pinned flag · 3ce7e67a
Chris Mason authored Jul 31, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
3ce7e67a

Btrfs: Fix streaming read performance with checksumming on · 61b49440

Chris Mason authored Jul 31, 2008

Large streaming reads make for large bios, which means each entry on the
list async work queues represents a large amount of data. IO
congestion throttling on the device was kicking in before the async
worker threads decided a single thread was busy and needed some help.

The end result was that a streaming read would result in a single CPU
running at 100% instead of balancing the work off to other CPUs.

This patch also changes the pre-IO checksum lookup done by reads to
work on a per-bio basis instead of a per-page. This results in many
extra btree lookups on large streaming reads. Doing the checksum lookup
right before bio submit allows us to reuse searches while processing
adjacent offsets.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

61b49440

Btrfs: Throttle tuning · 37d1aeee

Chris Mason authored Jul 31, 2008

This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.

It also lowers the limits for where we start throttling.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

37d1aeee

Btrfs: Add missing hunk from Yan Zheng's cache reclaim patch · 47ac14fa
Chris Mason authored Jul 31, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
47ac14fa

Btrfs: Add compatibility for kernels >= 2.6.27-rc1 · 0ee0fda0

Sven Wegener authored Jul 30, 2008

Add a couple of #if's to follow API changes.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

0ee0fda0

Btrfs: implement memory reclaim for leaf reference cache · bcc63abb

Yan authored Jul 30, 2008

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

bcc63abb

Btrfs: Fix verify_parent_transid · 33958dc6

Chris Mason authored Jul 30, 2008

It was incorrectly clearing the up to date flag on the buffer even
when the buffer properly verified.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

33958dc6

Btrfs: Update and fix mount -o nodatacow · f321e491

Yan Zheng authored Jul 30, 2008

To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.

We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f321e491

Btrfs: async-thread: fix possible memory leak · 3bf10418

Li Zefan authored Jul 30, 2008

When kthread_run() returns failure, this worker hasn't been
added to the list, so btrfs_stop_workers() won't free it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3bf10418

Btrfs: Throttle operations if the reference cache gets too large · ab78c84d

Chris Mason authored Jul 29, 2008

A large reference cache is directly related to a lot of work pending
for the cleaner thread.  This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ab78c84d

Btrfs: Fix version.sh when used outside of an hg repo · 1a3f5d04
Chris Mason authored Jul 29, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
1a3f5d04

Btrfs: Leaf reference cache update · 017e5369

Chris Mason authored Jul 28, 2008

This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.

This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.

Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

017e5369

Btrfs: Add a leaf reference cache · 31153d81

Yan Zheng authored Jul 28, 2008

Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

31153d81

Btrfs: Rev the disk format magic · 3a115f52
Chris Mason authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
3a115f52

Btrfs: Null terminate strings passed in from userspace · 5516e595

Mark Fasheh authored Jul 24, 2008

The 'char name[BTRFS_PATH_NAME_MAX]' member of struct btrfs_ioctl_vol_args
is passed directly to strlen() after being copied from user. I haven't
verified this, but in theory a userspace program could pass in an
unterminated string and cause a kernel crash as strlen walks off the end of
the array.

This patch terminates the ->name string in all btrfs ioctl functions which
currently use a 'struct btrfs_ioctl_vol_args'. Since the string is now
properly terminated, it's length will never be longer than
BTRFS_PATH_NAME_MAX so that error check has been removed.

By the way, it might be better overall to just have the ioctl pass an
unterminated string + length structure but I didn't bother with that since
it'd change the kernel/user interface.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

5516e595

Fix path slots selection in btrfs_search_forward · 9652480b

Yan authored Jul 24, 2008

We should decrease the found slot by one as btrfs_search_slot does
when bin_search return 1 and node level > 0.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

9652480b

Btrfs: Fix .. lookup corner case · 445dceb7

Yan authored Jul 24, 2008

Inode ref item can be in the next leaf when we find "path->slots[0] ==
btrfs_header_nritems(...)".
Signed-off-by: Chris Mason <chris.mason@oracle.com>

445dceb7

Btrfs: Properly release lock in pin_down_bytes · 974e35a8

Yan authored Jul 24, 2008

When buffer isn't uptodate, pin_down_bytes may leave the tree locked
after it returns.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

974e35a8

Btrfs: Remove unused variable in fixup_tree_root_location · 45467261

Balaji Rao authored Jul 24, 2008

Remove a unused variable 'path' in fixup_tree_root_location.
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

45467261

Btrfs: Fix a few functions that exit without stopping their transaction · 8e8a1e31
Josef Bacik authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
8e8a1e31
Btrfs: Create orphan inode records to prevent lost files after a crash · 7b128766
Josef Bacik authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
7b128766
Btrfs: Add ACL support · 33268eaf
Josef Bacik authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
33268eaf
Btrfs: Remove unused xattr code · 6099afe8
Josef Bacik authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
6099afe8
Btrfs: Implement new dir index format · aec7477b
Josef Bacik authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
aec7477b

Btrfs: Fix the defragmention code and the block relocation code for data=ordered · 3eaa2885

Chris Mason authored Jul 24, 2008

Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.

Also, the relocation code needs to wait for ordered IO before scanning
the block group again.  This is because the extents are not removed
until the IO for the new extents is finished
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3eaa2885

Btrfs: Use assert_spin_locked instead of spin_trylock · 64f26f74
David Woodhouse authored Jul 24, 2008
```
On UP systems spin_trylock always succeeds
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
64f26f74
Btrfs: Add version strings on module load · b3c3da71
Chris Mason authored Jul 23, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
b3c3da71
Btrfs: Fix some build problems on 2.6.18 based enterprise kernels · 4881ee5a
Chris Mason authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
4881ee5a

Btrfs: Search data ordered extents first for checksums on read · 89642229

Chris Mason authored Jul 24, 2008

Checksum items are not inserted into the tree until all of the io from a
given extent is complete. This means one dirty page from an extent may
be written, freed, and then read again before the entire extent is on disk
and the checksum item is inserted.

The checksums themselves are stored in the ordered extent so they can
be inserted in bulk when IO is complete. On read, if a checksum item isn't
found, the ordered extents were being searched for a checksum record.

This all worked most of the time, but the checksum insertion code tries
to reduce the number of tree operations by pre-inserting checksum items
based on i_size and a few other factors. This means the read code might
find a checksum item that hasn't yet really been filled in.

This commit changes things to check the ordered extents first and only
dive into the btree if nothing was found. This removes the need for
extra locking and is more reliable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

89642229

Btrfs: Fix 32 bit compiles by using an unsigned long byte count in the ordered extent · 9ba4611a
Chris Mason authored Jul 23, 2008
```
The ordered extents have to fit in memory, so an unsigned long is sufficient.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
9ba4611a
Btrfs: Take the csum mutex while reading checksums · ed98b56a
Chris Mason authored Jul 22, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
ed98b56a

Btrfs: alloc_mutex latency reduction · c286ac48

Chris Mason authored Jul 22, 2008

This releases the alloc_mutex in a few places that hold it for over long
operations.  btrfs_lookup_block_group is changed so that it doesn't need
the mutex at all.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c286ac48

Btrfs: Add some conditional schedules near the alloc_mutex · e34a5b4f

Chris Mason authored Jul 22, 2008

This helps prevent stalls, especially while the snapshot cleaner is
running hard
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e34a5b4f

Btrfs: Use mutex_lock_nested for tree locking · 6dddcbeb

Chris Mason authored Jul 22, 2008

Lockdep has the notion of locking subclasses so that you can identify
locks you expect to be taken after other locks of the same class. This
changes the per-extent buffer btree locking routines to use a subclass based
on the level in the tree.

Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs
max level is also 8. So when lockdep is on, use a lower max level.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

6dddcbeb

Btrfs: Fix some data=ordered related data corruptions · f421950f

Chris Mason authored Jul 22, 2008

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f421950f

Btrfs: Use a mutex in the extent buffer for tree block locking · a61e6f29

Chris Mason authored Jul 22, 2008

This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.

The mutexes alone don't fix either problem, but they are the first step.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a61e6f29