Commits · 5cee5815d1564bbbd505fea86f4550f1efdb5cd0 · nexedi / linux

12 Jun, 2009 24 commits

vfs: Make sys_sync() use fsync_super() (version 4) · 5cee5815

Jan Kara authored Apr 27, 2009

It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.

Nice bonus is that we get a complete livelock avoidance and write_supers()
is now only used for periodic writeback of superblocks.

sync_blockdevs() introduced a couple of patches ago is gone now.

[build fixes folded]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5cee5815

vfs: Make __fsync_super() a static function (version 4) · 429479f0

Jan Kara authored Apr 27, 2009

__fsync_super() does the same thing as fsync_super(). So change the only
caller to use fsync_super() and make __fsync_super() static. This removes
unnecessarily duplicated call to sync_blockdev() and prepares ground
for the changes to __fsync_super() in the following patches.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

429479f0

vfs: Call ->sync_fs() even if s_dirt is 0 (version 4) · bfe88125

Jan Kara authored Apr 27, 2009

sync_filesystems() has a condition that if wait == 0 and s_dirt == 0, then
->sync_fs() isn't called. This does not really make much sence since s_dirt is
generally used by a filesystem to mean that ->write_super() needs to be called.
But ->sync_fs() does different things. I even suspect that some filesystems
(btrfs?) sets s_dirt just to fool this logic.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

bfe88125

vfs: Fix sys_sync() and fsync_super() reliability (version 4) · 5a3e5cb8

Jan Kara authored Apr 27, 2009

So far, do_sync() called:
  sync_inodes(0);
  sync_supers();
  sync_filesystems(0);
  sync_filesystems(1);
  sync_inodes(1);

This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
and others are hit by this) when racing e.g. with background writeback. A
similar problem hits also other filesystems (e.g. ext2) because of
write_supers() being called before the sync_inodes(1).

Change the ordering of calls in do_sync() - this requires a new function
sync_blockdevs() to preserve the property that block devices are always synced
after write_super() / sync_fs() call.

The same issue is fixed in __fsync_super() function used on umount /
remount read-only.

[AV: build fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5a3e5cb8

remove s_async_list · 876a9f76

Christoph Hellwig authored Apr 28, 2009

Remove the unused s_async_list in the superblock, a leftover of the
broken async inode deletion code that leaked into mainline.  Having this
in the middle of the sync/unmount path is not helpful for the following
cleanups.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

876a9f76

fs: move mark_files_ro into file_table.c · 864d7c4c

npiggin@suse.de authored Apr 26, 2009

This function walks the s_files lock, and operates primarily on the
files in a superblock, so it better belongs here (eg. see also
fs_may_remount_ro).

[AV: ... and it shouldn't be static after that move]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

864d7c4c

fs: introduce mnt_clone_write · 96029c4e

npiggin@suse.de authored Apr 26, 2009

This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.

Before:
 avg = 462.286
 std = 5.46106

After:
 avg = 453.12
 std = 9.58257

(50 runs of each, stddev gives a reasonable confidence)

It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.

After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).

[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

96029c4e

fs: mnt_want_write speedup · d3ef3d73

npiggin@suse.de authored Apr 26, 2009

This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
A microbenchmark yes, but it exercises some important paths in the mm.

Before:
 avg = 501.9
 std = 14.7773

After:
 avg = 462.286
 std = 5.46106

(50 runs of each, stddev gives a reasonable confidence, but there is quite
a bit of variation there still)

It does this by removing the complex per-cpu locking and counter-cache and
replaces it with a percpu counter in struct vfsmount. This makes the code
much simpler, and avoids spinlocks (although the msync is still pretty
costly, unfortunately). It results in about 900 bytes smaller code too. It
does increase the size of a vfsmount, however.

It should also give a speedup on large systems if CPUs are frequently operating
on different mounts (because the existing scheme has to operate on an atomic in
the struct vfsmount when switching between mounts). But I'm most interested in
the single threaded path performance for the moment.

[AV: minor cleanup]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d3ef3d73

Move junk from proc_fs.h to fs/proc/internal.h · 3174c21b
Al Viro authored Apr 07, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
3174c21b
switch lookup_mnt() · 1c755af4
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
1c755af4
switch follow_mount() · 79ed0226
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
79ed0226
switch follow_down() · 9393bd07
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
9393bd07
Switch collect_mounts() to struct path · 589ff870
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
589ff870
switch follow_up() to struct path · bab77ebf
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
bab77ebf
switch rqst_exp_parent() · e64c390c
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
e64c390c
switch rqst_exp_get_by_name() · 91c9fa8f
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
91c9fa8f

switch exp_parent() to struct path · 5bf3bd2b

Al Viro authored Apr 18, 2009

... and lose the always-NULL last argument (non-NULL case had been
split off a while ago).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5bf3bd2b

nfsd struct path use: exp_get_by_name() · 55430e2e
Al Viro authored Apr 18, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
55430e2e

Don't bother with check_mnt() in do_add_mount() on shrinkable ones · dd5cae6e

Al Viro authored Apr 07, 2009

These guys are what we add as submounts; checks for "is that attached in
our namespace" are simply irrelevant for those and counterproductive for
use of private vfsmount trees a-la what NFS folks want.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

dd5cae6e

Make vfs_path_lookup() use starting point as root · 5b857119
Al Viro authored Apr 07, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
5b857119

Cache root in nameidata · 2a737871

Al Viro authored Apr 07, 2009

New field: nd->root. When pathname resolution wants to know the root,
check if nd->root.mnt is non-NULL; use nd->root if it is, otherwise
copy current->fs->root there. After path_walk() is finished, we check
if we'd got a cached value in nd->root and drop it. Before calling
path_walk() we should either set nd->root.mnt to NULL *or* copy (and
pin down) some path to nd->root. In the latter case we won't be
looking at current->fs->root at all.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2a737871

Preparations to caching root in path_walk() · 9b4a9b14

Al Viro authored Apr 07, 2009

Split do_path_lookup(), opencode the call from do_filp_open()
do_filp_open() is the only caller of do_path_lookup() that
cares about root afterwards (it keeps resolving symlinks on
O_CREAT path after it'd done LOOKUP_PARENT walk).  So when
we start caching fs->root in path_walk(), it'll need a different
treatment.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9b4a9b14

Get rid of path_lookup in autofs4 · 4e44b685
Al Viro authored Apr 07, 2009
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
4e44b685

reiserfs: allow exposing privroot w/ xattrs enabled · 73422811

Jeff Mahoney authored May 10, 2009

This patch adds an -oexpose_privroot option to allow access to the privroot.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

73422811

11 Jun, 2009 16 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable · a525890c

Linus Torvalds authored Jun 11, 2009

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (23 commits)
  Btrfs: fix extent_buffer leak during tree log replay
  Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
  Btrfs: fix -o nodatasum printk spelling
  Btrfs: check duplicate backrefs for both data and metadata
  Btrfs: init worker struct fields before kthread-run
  Btrfs: pin buffers during write_dev_supers
  Btrfs: avoid races between super writeout and device list updates
  Fix btrfs when ACLs are configured out
  Btrfs: fdatasync should skip metadata writeout
  Btrfs: remove crc32c.h and use libcrc32c directly.
  Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
  Btrfs: autodetect SSD devices
  Btrfs: add mount -o ssd_spread to spread allocations out
  Btrfs: avoid allocation clusters that are too spread out
  Btrfs: Add mount -o nossd
  Btrfs: avoid IO stalls behind congested devices in a multi-device FS
  Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
  Btrfs: fix metadata dirty throttling limits
  Btrfs: reduce mount -o ssd CPU usage
  Btrfs: balance btree more often
  ...

a525890c

Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify · 3bb66d7f

Linus Torvalds authored Jun 11, 2009

* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  fsnotify: allow groups to set freeing_mark to null
  inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
  dnotify: do not bother to lock entry->lock when reading mask
  dnotify: do not use ?true:false when assigning to a bool
  fsnotify: move events should indicate the event was on a child
  inotify: reimplement inotify using fsnotify
  fsnotify: handle filesystem unmounts with fsnotify marks
  fsnotify: fsnotify marks on inodes pin them in core
  fsnotify: allow groups to add private data to events
  fsnotify: add correlations between events
  fsnotify: include pathnames with entries when possible
  fsnotify: generic notification queue and waitq
  dnotify: reimplement dnotify using fsnotify
  fsnotify: parent event notification
  fsnotify: add marks to inodes so groups can interpret how to handle those inodes
  fsnotify: unified filesystem notification backend

3bb66d7f

Merge branch 'for-linus' of git://linux-arm.org/linux-2.6 · 512626a0

Linus Torvalds authored Jun 11, 2009

* 'for-linus' of git://linux-arm.org/linux-2.6:
  kmemleak: Add the corresponding MAINTAINERS entry
  kmemleak: Simple testing module for kmemleak
  kmemleak: Enable the building of the memory leak detector
  kmemleak: Remove some of the kmemleak false positives
  kmemleak: Add modules support
  kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
  kmemleak: Add the vmalloc memory allocation/freeing hooks
  kmemleak: Add the slub memory allocation/freeing hooks
  kmemleak: Add the slob memory allocation/freeing hooks
  kmemleak: Add the slab memory allocation/freeing hooks
  kmemleak: Add documentation on the memory leak detector
  kmemleak: Add the base support

Manual conflict resolution (with the slab/earlyboot changes) in:
	drivers/char/vt.c
	init/main.c
	mm/slab.c

512626a0

Merge branch 'perfcounters-for-linus' of... · 8a1ca8ce

Linus Torvalds authored Jun 11, 2009

Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
  perf_counter: Turn off by default
  perf_counter: Add counter->id to the throttle event
  perf_counter: Better align code
  perf_counter: Rename L2 to LL cache
  perf_counter: Standardize event names
  perf_counter: Rename enums
  perf_counter tools: Clean up u64 usage
  perf_counter: Rename perf_counter_limit sysctl
  perf_counter: More paranoia settings
  perf_counter: powerpc: Implement generalized cache events for POWER processors
  perf_counters: powerpc: Add support for POWER7 processors
  perf_counter: Accurate period data
  perf_counter: Introduce struct for sample data
  perf_counter tools: Normalize data using per sample period data
  perf_counter: Annotate exit ctx recursion
  perf_counter tools: Propagate signals properly
  perf_counter tools: Small frequency related fixes
  perf_counter: More aggressive frequency adjustment
  perf_counter/x86: Fix the model number of Intel Core2 processors
  perf_counter, x86: Correct some event and umask values for Intel processors
  ...

8a1ca8ce

Merge branch 'topic/slab/earlyboot' of... · b640f042

Linus Torvalds authored Jun 11, 2009

Merge branch 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6

* 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  vgacon: use slab allocator instead of the bootmem allocator
  irq: use kcalloc() instead of the bootmem allocator
  sched: use slab in cpupri_init()
  sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()
  memcg: don't use bootmem allocator in setup code
  irq/cpumask: make memoryless node zero happy
  x86: remove some alloc_bootmem_cpumask_var calling
  vt: use kzalloc() instead of the bootmem allocator
  sched: use kzalloc() instead of the bootmem allocator
  init: introduce mm_init()
  vmalloc: use kzalloc() instead of alloc_bootmem()
  slab: setup allocators earlier in the boot sequence
  bootmem: fix slab fallback on numa
  bootmem: use slab if bootmem is no longer available

b640f042

fsnotify: allow groups to set freeing_mark to null · a092ee20

Eric Paris authored Jun 11, 2009

Most fsnotify listeners (all but inotify) do not care about marks being
freed.  Allow groups to set freeing_mark to null and do not call any
function if it is set that way.
Signed-off-by: Eric Paris <eparis@redhat.com>

a092ee20

inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD · e42e2773

Eric Paris authored Jun 11, 2009

inotify and dnotify will both indicate that they want any event which came
from a child inode. The fix is to mask off FS_EVENT_ON_CHILD when deciding
if inotify or dnotify is interested in a given event.
Signed-off-by: Eric Paris <eparis@redhat.com>

e42e2773

dnotify: do not bother to lock entry->lock when reading mask · ce61856b

Eric Paris authored Jun 11, 2009

entry->lock is needed to make sure entry->mask does not change while
manipulating it.  In dnotify_should_send_event() we don't care if we get an
old or a new mask value out of this entry so there is no point it taking
the lock.
Signed-off-by: Eric Paris <eparis@redhat.com>

ce61856b

dnotify: do not use ?true:false when assigning to a bool · 5ac697b7

Eric Paris authored Jun 11, 2009

dnotify_should send event assigned a bool using ?true:false when computing
a bit operation. This is poitless and the bool type does this for us.
Signed-off-by: Eric Paris <eparis@redhat.com>

5ac697b7

fsnotify: move events should indicate the event was on a child · ff52cc21

Eric Paris authored Jun 11, 2009

fsnotify tells its listeners explicitly when an event happened on the given
inode verses on the child of the given inode. (see __fsnotify_parent)
However, the semantics of fsnotify_move() are such that we deliver events
directly to the two parent directories in question (old_dir and new_dir)
directly without using the __fsnotify_parent() call. fsnotify should be
adding FS_EVENT_ON_CHILD for the notifications to these parents.
Signed-off-by: Eric Paris <eparis@redhat.com>

ff52cc21

inotify: reimplement inotify using fsnotify · 63c882a0