Commits · 4008c04a07c73ec3cb1be4c1391d2159a8f75d6d · Kirill Smelkov / linux

12 Feb, 2009 10 commits

Btrfs: make a lockdep class for the extent buffer locks · 4008c04a

Chris Mason authored Feb 12, 2009

Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block. But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4008c04a

Btrfs: fs/btrfs/volumes.c: remove useless kzalloc · 3f3420df

Julia Lawall authored Feb 12, 2009

The call to kzalloc is followed by a kmalloc whose result is stored in the
same variable.

The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@

(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
     when != if (...) { <+...x...+> }
x->f = E
...>
(
 return \(0\|<+...x...+>\|ptr\);
|
 return@p2 ...;
)

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3f3420df

Btrfs: remove unused code in split_state() · a48ddf08

Qinghuang Feng authored Feb 12, 2009

These two lines are not used, remove them.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a48ddf08

Btrfs: remove btrfs_init_path · e00f7308

Jeff Mahoney authored Feb 12, 2009

btrfs_init_path was initially used when the path objects were on the
stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e00f7308

Btrfs: balance_level checks !child after access · 7951f3ce

Jeff Mahoney authored Feb 12, 2009

The BUG_ON() is in the wrong spot.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

7951f3ce

Btrfs: Avoid using __GFP_HIGHMEM with slab allocator · b335b003

Yan Zheng authored Feb 12, 2009

btrfs_releasepage may call kmem_cache_alloc indirectly,
and provide same GFP flags it gets to kmem_cache_alloc.
So it's possible to use __GFP_HIGHMEM with the slab
allocator.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>

b335b003

Btrfs: don't clean old snapshots on sync(1) · e1df36d2

Chris Mason authored Feb 12, 2009

Cleaning old snapshots can make sync(1) somewhat slow, and some users
and applications still use it in a global fsync kind of workload.

This patch changes btrfs not to clean old snapshots during sync, which is
safe from a FS consistency point of view. The major downside is that it
makes it difficult to tell when old snapshots have been reaped and
the space they were using has been reclaimed. A new ioctl will be added
for this purpose instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e1df36d2

Btrfs: use larger metadata clusters in ssd mode · 536ac8ae

Chris Mason authored Feb 12, 2009

Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks.  The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.

On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

536ac8ae

Btrfs: process mount options on mount -o remount, · b288052e

Chris Mason authored Feb 12, 2009

Btrfs wasn't parsing any new mount options during remount, making it
difficult to set mount options on a root drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

b288052e

Btrfs: make sure all pending extent operations are complete · eb099670

Josef Bacik authored Feb 12, 2009

Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.

This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.

So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return. I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>

eb099670

09 Feb, 2009 1 commit

Btrfs: don't use spin_is_contended · 284b066a

Chris Mason authored Feb 09, 2009

Btrfs was using spin_is_contended to see if it should drop locks before
doing extent allocations during btrfs_search_slot.  The idea was to avoid
expensive searches in the tree unless the lock was actually contended.

But, spin_is_contended is specific to the ticket spinlocks on x86, so this
is causing compile errors everywhere else.

In practice, the contention could easily appear some time after we started
doing the extent allocation, and it makes more sense to always drop the lock
instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

284b066a

06 Feb, 2009 1 commit

Btrfs: Make sure dir is non-null before doing S_ISGID checks · 42f15d77

Chris Mason authored Feb 06, 2009

The S_ISGID check in btrfs_new_inode caused an oops during subvol creation
because sometimes the dir is null.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

42f15d77

05 Feb, 2009 1 commit

Btrfs: Fix memory leak in cache_drop_leaf_ref · 806638bc

Chris Mason authored Feb 05, 2009

The code wasn't doing a kfree on the sorted array
Signed-off-by: Chris Mason <chris.mason@oracle.com>

806638bc

04 Feb, 2009 17 commits

Btrfs: don't return congestion in write_cache_pages as often · 9b0d3ace

Chris Mason authored Feb 04, 2009

On fast devices that go from congested to uncongested very quickly, pdflush
is waiting too often in congestion_wait, and the FS is backing off to
easily in write_cache_pages.

For now, fix this on the btrfs side by only checking congestion after
some bios have already gone down.  Longer term a real fix is needed
for pdflush, but that is a larger project.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

9b0d3ace

Btrfs: Only prep for btree deletion balances when nodes are mostly empty · 7b78c170

Chris Mason authored Feb 04, 2009

Whenever an item deletion is done, we need to balance all the nodes
in the tree to make sure we don't end up with an empty node if a pointer
is deleted.  This balance prep happens from the root of the tree down
so we can drop our locks as we go.

reada_for_balance was triggering read-ahead on neighboring nodes even
when no balancing was required.  This adds an extra check to avoid
calling balance_level() and avoid reada_for_balance() when a balance
won't be required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

7b78c170

Btrfs: fix btrfs_unlock_up_safe to walk the entire path · 12f4dacc

Chris Mason authored Feb 04, 2009

btrfs_unlock_up_safe would break out at the first NULL node entry or
unlocked node it found in the path.

Some of the callers have missing nodes at the lower levels of the path, so this
commit fixes things to check all the nodes in the path before returning.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

12f4dacc

Btrfs: change btrfs_del_leaf to drop locks earlier · 4d081c41

Chris Mason authored Feb 04, 2009

btrfs_del_leaf does two things.  First it removes the pointer in the
parent, and then it frees the block that has the leaf.  It has the
parent node locked for both operations.

But, it only needs the parent locked while it is deleting the pointer.
After that it can safely free the block without the parent locked.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4d081c41

Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode · 06d9a8d7

Chris Mason authored Feb 04, 2009

btrfs_truncate_inode_items is setup to stop doing btree searches when
it has finished removing the items for the inode.  It used to detect the
end of the inode by looking for an objectid that didn't match the
one we were searching for.

But, this would result in an extra search through the btree, which
adds extra balancing and cow costs to the operation.

This commit adds a check to see if we found the inode item, which means
we can stop searching early.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

06d9a8d7

Btrfs: Don't try to compress pages past i_size · f03d9301

Chris Mason authored Feb 04, 2009

The compression code had some checks to make sure we were only
compressing bytes inside of i_size, but it wasn't catching every
case.  To make things worse, some incorrect math about the number
of bytes remaining would make it try to compress more pages than the
file really had.

The fix used here is to fall back to the non-compression code in this
case, which does all the proper cleanup of delalloc and other accounting.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f03d9301

Btrfs: join the transaction in __btrfs_setxattr · 81144949

Josef Bacik authored Feb 04, 2009

With selinux on we end up calling __btrfs_setxattr when we create an inode,
which calls btrfs_start_transaction(). The problem is we've already called
that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a
wait_current_trans(). If btrfs-transaction has started committing it will wait
for all handles to finish, while the other process is waiting for the
transaction to commit. This is fixed by using btrfs_join_transaction, which
won't wait for the transaction to commit. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>

81144949

Btrfs: Handle SGID bit when creating inodes · 8c087b51

Chris Ball authored Feb 04, 2009

Before this patch, new files/dirs would ignore the SGID bit on their
parent directory and always be owned by the creating user's uid/gid.
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

8c087b51

Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302

Chris Mason authored Feb 04, 2009

Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion.  Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.

If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.

If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.

The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning.  This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.

But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.

This changes things to walk down to a level 1 node and then process it
in bulk.  All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.

The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down.  Overall it does less IO and is better able to keep up with
snapshot deletions under high load.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

bd56b302

Btrfs: Change btree locking to use explicit blocking points · b4ce94de

Chris Mason authored Feb 04, 2009

Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock.  The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time.  So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

b4ce94de

Btrfs: hash_lock is no longer needed · c487685d

Chris Mason authored Feb 04, 2009

Before metadata is written to disk, it is updated to reflect that writeout
has begun.  Once this update is done, the block must be cow'd before it
can be modified again.

This update was originally synchronized by using a per-fs spinlock.  Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.

So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c487685d

Btrfs: disable leak debugging checks in extent_io.c · 3935127c

Chris Mason authored Feb 04, 2009

extent_io.c has debugging code to report and free leaked extent_state
and extent_buffer objects at rmmod time.  This helps track down
leaks and it saves you from rebooting just to properly remove the
kmem_cache object.

But, the code runs under a fairly expensive spinlock and the checks to
see if it is currently enabled are not entirely consistent.  Some use
#ifdef and some #if.

This changes everything to #if and disables the leak checking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3935127c

Btrfs: sort references by byte number during btrfs_inc_ref · b7a9f29f

Chris Mason authored Feb 04, 2009

When a block goes through cow, we update the reference counts of
everything that block points to. The internal pointers of the block
can be in just about any order, and it is likely to have clusters of
things that are close together and clusters of things that are not.

To help reduce the seeks that come with updating all of these reference
counts, sort them by byte number before actual updates are done.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

b7a9f29f

Btrfs: async threads should try harder to find work · b51912c9

Chris Mason authored Feb 04, 2009

Tracing shows the delay between when an async thread goes to sleep
and when more work is added is often very short.  This commit adds
a little bit of delay and extra checking to the code right before
we schedule out.

It allows more work to be added to the worker
without requiring notifications from other procs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

b51912c9

Btrfs: selinux support · 0279b4cd

Jim Owens authored Feb 04, 2009

Add call to LSM security initialization and save
resulting security xattr for new inodes.

Add xattr support to symlink inode ops.

Set inode->i_op for existing special files.
Signed-off-by: jim owens <jowens@hp.com>

0279b4cd

Btrfs: make btrfs acls selectable · bef62ef3

Christian Hesse authored Feb 04, 2009

This patch adds a menu entry to kconfig to enable acls for btrfs.
This allows you to enable FS_POSIX_ACL at kernel compile time.

(updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)
Signed-off-by: Christian Hesse <mail@earthworm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>

bef62ef3

Btrfs: Catch missed bios in the async bio submission thread · a6837051

Chris Mason authored Feb 04, 2009

The async bio submission thread was missing some bios that were
added after it had decided there was no work left to do.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a6837051

28 Jan, 2009 1 commit

Btrfs: fix readdir on 32 bit machines · 89f135d8

Chris Mason authored Jan 28, 2009

After btrfs_readdir has gone through all the directory items, it
sets the directory f_pos to the largest possible int.  This way
applications that mix readdir with creating new files don't
end up in an endless loop finding the new directory items as they go.

It was a workaround for a bug in git, but the assumption was that if git
could make this looping mistake than it would be a common problem.

The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
and it is possible for that to be a larger number than 32 bit glibc
expects to come out of readdir.

This patches switches that to INT_LIMIT(off_t), which should keep
applications happy on 32 and 64 bit machines.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

89f135d8

29 Jan, 2009 1 commit
- Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable · e4f722fa
  Chris Mason authored Jan 28, 2009
```
Fix fs/btrfs/super.c conflict around #includes
```
  e4f722fa
28 Jan, 2009 8 commits

Linux 2.6.29-rc3 · 18e352e4
Linus Torvalds authored Jan 28, 2009

18e352e4

Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc · c4568d6c

Linus Torvalds authored Jan 28, 2009

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/mm: Fix handling of _PAGE_COHERENT in BAT setup code
  powerpc/pseries: Correct VIO bus accounting problem in CMO env.
  powerpc: More printing warning fixes for the l64 to ll64 conversion
  powerpc: Remove arch/ppc cruft from Kconfig
  powerpc: Printing fix for l64 to ll64 conversion: phyp_dump.c
  powerpc/embedded6xx: Update defconfigs
  powerpc/8xx: Update defconfigs
  powerpc/86xx: Update defconfigs
  powerpc/83xx: Update defconfigs
  powerpc/85xx: Update defconfigs
  powerpc/mpc8313erdb: fix kernel panic because mdio device is not probed
  powerpc/4xx: Update multi-board PowerPC 4xx defconfigs
  powerpc/44x: Update PowerPC 44x defconfigs
  powerpc/40x: Update PowerPC 40x defconfigs
  powerpc/85xx: Fix typo in mpc8572ds dts
  powerpc/44x: Warp patches for the new NDFC driver
  powerpc/4xx: DTS: Add Add'l SDRAM0 Compatible and Interrupt Info

c4568d6c

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu · 78a768b6

Linus Torvalds authored Jan 28, 2009

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68knommu: fix 5329 ColdFire periphal addressing
  uclinux: add process name to allocation error message
  m68knommu: correct the mii calculations for 532x ColdFire FEC
  m68knommu: add ColdFire M532x to the FEC configuration options
  m68knommu: fix syscall restarting
  m68knommu: remove the obsolete and long unused comempci chip support
  m68knommu: remove the no longer used PCI support option
  m68knommu: remove obsolete and unused eLIA board
  m68knommu: set NO_DMA
  m68knommu: fix cache flushing for the 527x ColdFire processors
  m68knommu: fix ColdFire 5272 serial baud rates in mcf.c
  m68knommu: use one exist from execption

78a768b6

dmi: Fix build breakage · d8204ee2

Kumar Gala authored Jan 28, 2009

Commit d7b1956f ("DMI: Introduce
dmi_first_match to make the interface more flexible") introduced compile
errors like the following when !CONFIG_DMI

drivers/ata/sata_sil.c: In function 'sil_broken_system_poweroff':
drivers/ata/sata_sil.c:713: error: implicit declaration of function 'dmi_first_match'
drivers/ata/sata_sil.c:713: warning: initialization makes pointer from integer without a cast

We just need a dummy version of dmi_first_match() to fix this all up.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d8204ee2

drm: Rip out the racy, unused vblank signal code. · 30b23634

Eric Anholt authored Jan 27, 2009

Schedule a vblank signal, kill the process, and we'll go walking over freed
memory.  Given that no open-source userland exists using this, nor have I
ever heard of a consumer, just let this code die.
Signed-off-by: Eric Anholt <eric@anholt.net>
Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Dave Airlie <airlied@linux.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

30b23634

powerpc/mm: Fix handling of _PAGE_COHERENT in BAT setup code · 4c456a67

Gerhard Pircher authored Jan 23, 2009

_PAGE_COHERENT is now always set in _PAGE_RAM resp. PAGE_KERNEL.
Thus it has to be masked out, if the BAT mapping should be non
cacheable or CPU_FTR_NEED_COHERENT is not set.

This will work on normal SMP setups because we force-set
CPU_FTR_NEED_COHERENT as part of CPU_FTR_COMMON on SMP.
Signed-off-by: Gerhard Pircher <gerhard_pircher@gmx.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

4c456a67

powerpc/pseries: Correct VIO bus accounting problem in CMO env. · 69b052e8

Robert Jennings authored Jan 22, 2009

In the VIO bus code the wrappers for dma alloc_coherent and free_coherent
calls are rounding to IOMMU_PAGE_SIZE. Taking a look at the underlying
calls, the actual mapping is promoted to PAGE_SIZE. Changing the
rounding in these two functions fixes under-reporting the entitlement
used by the system. Without this change, the system could run out of
entitlement before it believes it has and incur mapping failures at the
firmware level.

Also in the VIO bus code, the wrapper for dma map_sg is not exiting in
an error path where it should. Rather than fall through to code for the
success case, this patch adds the return that is needed in the error path.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

69b052e8

powerpc: More printing warning fixes for the l64 to ll64 conversion · 4712fff9

Stephen Rothwell authored Jan 21, 2009

These are all powerpc specific drivers.

res.start in fsl_elbc_nand.c needs to be cast since it may be either 32
or 64 bit.  Thanks to Scott Wood for noticing.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Arnd Bergmann <arnd@arndb.de> call_edac bits in particular
Acked-by: Olof Johansson <olof@lixom.net> pasemi_nand peices
Acked-by: Scott Wood <scottwood@freescale.com> fsl_elbc fixes
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

4712fff9