Commit 1b3d7c93 authored by Alexander Viro's avatar Alexander Viro Committed by Linus Torvalds

[PATCH] (2.5.4) death of ->i_zombie

	Rediffed to 2.5.4, documentation added.  This variant grabs
->s_vfs_rename_sem only for cross-directory renames.
parent 9c73428c
...@@ -48,28 +48,30 @@ prototypes: ...@@ -48,28 +48,30 @@ prototypes:
locking rules: locking rules:
all may block all may block
BKL i_sem(inode) i_zombie(inode) BKL i_sem(inode)
lookup: yes yes no lookup: yes yes
create: yes yes yes create: yes yes
link: yes yes yes link: yes yes
mknod: yes yes yes mknod: yes yes
mkdir: yes yes yes mkdir: yes yes
unlink: yes yes yes unlink: yes yes (both)
rmdir: yes yes yes (see below) rmdir: yes yes (both) (see below)
rename: yes yes (both) yes (both) (see below) rename: yes yes (all) (see below)
readlink: no no no readlink: no no
follow_link: no no no follow_link: no no
truncate: yes yes no (see below) truncate: yes yes (see below)
setattr: yes if ATTR_SIZE no setattr: yes if ATTR_SIZE
permssion: yes no no permssion: yes no
getattr: (see below) getattr: (see below)
revalidate: no (see below) revalidate: no (see below)
Additionally, ->rmdir() has i_zombie on victim and so does ->rename() setxattr: DOCUMENT_ME
in case when target exists and is a directory. getxattr: DOCUMENT_ME
->rename() on directories has (per-superblock) ->s_vfs_rename_sem. removexattr: DOCUMENT_ME
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
->revalidate(), it may be called both with and without the i_sem ->revalidate(), it may be called both with and without the i_sem
on dentry->d_inode. VFS never calls it with i_zombie on dentry->d_inode, on dentry->d_inode.
but watch for other methods directly calling this one...
->truncate() is never called directly - it's a callback, not a ->truncate() is never called directly - it's a callback, not a
method. It's called by vmtruncate() - library function normally used by method. It's called by vmtruncate() - library function normally used by
->setattr(). Locking information above applies to that call (i.e. is ->setattr(). Locking information above applies to that call (i.e. is
...@@ -77,6 +79,9 @@ inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been ...@@ -77,6 +79,9 @@ inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
passed). passed).
->getattr() is currently unused. ->getattr() is currently unused.
See Documentation/filesystems/directory-locking for more detailed discussion
of the locking scheme for directory operations.
--------------------------- super_operations --------------------------- --------------------------- super_operations ---------------------------
prototypes: prototypes:
void (*read_inode) (struct inode *); void (*read_inode) (struct inode *);
......
Locking scheme used for directory operations is based on two
kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
For our purposes all operations fall in 5 classes:
1) read access. Locking rules: caller locks directory we are accessing.
2) object creation. Locking rules: same as above.
3) object removal. Locking rules: caller locks parent, finds victim,
locks victim and calls the method.
4) rename() that is _not_ cross-directory. Locking rules: caller locks
the parent, finds source and target, if target already exists - locks it
and then calls the method.
5) cross-directory rename. The trickiest in the whole bunch. Locking
rules:
* lock the filesystem
* lock parents in "ancestors first" order.
* find source and target.
* if old parent is equal to or is a descendent of target
fail with -ENOTEMPTY
* if new parent is equal to or is a descendent of source
fail with -ELOOP
* if target exists - lock it.
* call the method.
The rules above obviously guarantee that all directories that are going to be
read, modified or removed by method will be locked by caller.
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
First of all, at any moment we have a partial ordering of the
objects - A < B iff A is an ancestor of B.
That ordering can change. However, the following is true:
(1) if operation different from cross-directory rename holds lock on A and
attempts to acquire lock on B, A will remain the parent of B until we
acquire the lock on B. (Proof: only cross-directory rename can change
the parent of object and it would have to lock the parent).
(2) if cross-directory rename holds the lock on filesystem, order will not
change until rename acquires all locks. (Proof: other cross-directory
renames will be blocked on filesystem lock and we don't start changing
the order until we had acquired all locks).
Now consider the minimal deadlock. Each process is blocked on
attempt to acquire some lock and already holds at least one lock. Let's
consider the set of contended locks. First of all, filesystem lock is
not contended, since any process blocked on it is not holding any locks.
Thus all processes are blocked on ->i_sem.
Any contended object is either held by cross-directory rename or
has a child that is also contended. Indeed, suppose that it is held by
operation other than cross-directory rename. Then the lock this operation
is blocked on belongs to child of that object due to (1).
It means that one of the operations is cross-directory rename.
Otherwise the set of contended objects would be infinite - each of them
would have a contended child and we had assumed that no object is its
own descendent. Moreover, there is exactly one cross-directory rename
(see above).
Consider the object blocking the cross-directory rename. One of
its descendents is locked by cross-directory rename (otherwise we would again
have an infinite set of of contended objects). But that means that means
that cross-directory rename is taking locks out of order. Due to (2) the
order hadn't changed since we had acquired filesystem lock. But locking
rules for cross-directory rename guarantee that we do not try to acquire
lock on descendent before the lock on ancestor. Contradiction. I.e.
deadlock is impossible. Q.E.D.
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
Since the only new (parent, child) pair added by rename() is (new parent,
source), such loop would have to contain these objects and the rest of it
would have to exist before rename(). I.e. at the moment of loop creation
rename() responsible for that would be holding filesystem lock and new parent
would have to be equal to or a descendent of source. But that means that
new parent had been equal to or a descendent of source since the moment when
we had acquired filesystem lock and rename() would fail with -ELOOP in that
case.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
implementation assumes that directory graph is a tree. This assumption is
also preserved by all operations (cross-directory rename on a tree that would
not introduce a cycle will leave it a tree and link() fails for directories).
Notice that "directory" in the above == "anything that might have
children", so if we are going to introduce hybrid objects we will need
either to make sure that link(2) doesn't work for them or to make changes
in is_subdir() that would make it work even in presense of such beasts.
...@@ -472,11 +472,9 @@ static ssize_t bm_entry_write(struct file *file, const char *buffer, ...@@ -472,11 +472,9 @@ static ssize_t bm_entry_write(struct file *file, const char *buffer,
break; break;
case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root); case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root);
down(&root->d_inode->i_sem); down(&root->d_inode->i_sem);
down(&root->d_inode->i_zombie);
kill_node(e); kill_node(e);
up(&root->d_inode->i_zombie);
up(&root->d_inode->i_sem); up(&root->d_inode->i_sem);
dput(root); dput(root);
break; break;
...@@ -516,8 +514,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer, ...@@ -516,8 +514,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer,
if (IS_ERR(dentry)) if (IS_ERR(dentry))
goto out; goto out;
down(&root->d_inode->i_zombie);
err = -EEXIST; err = -EEXIST;
if (dentry->d_inode) if (dentry->d_inode)
goto out2; goto out2;
...@@ -556,7 +552,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer, ...@@ -556,7 +552,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer,
mntput(mnt); mntput(mnt);
err = 0; err = 0;
out2: out2:
up(&root->d_inode->i_zombie);
dput(dentry); dput(dentry);
out: out:
up(&root->d_inode->i_sem); up(&root->d_inode->i_sem);
...@@ -605,12 +600,10 @@ static ssize_t bm_status_write(struct file * file, const char * buffer, ...@@ -605,12 +600,10 @@ static ssize_t bm_status_write(struct file * file, const char * buffer,
case 2: enabled = 1; break; case 2: enabled = 1; break;
case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root); case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root);
down(&root->d_inode->i_sem); down(&root->d_inode->i_sem);
down(&root->d_inode->i_zombie);
while (!list_empty(&entries)) while (!list_empty(&entries))
kill_node(list_entry(entries.next, Node, list)); kill_node(list_entry(entries.next, Node, list));
up(&root->d_inode->i_zombie);
up(&root->d_inode->i_sem); up(&root->d_inode->i_sem);
dput(root); dput(root);
default: return res; default: return res;
......
...@@ -143,7 +143,6 @@ void inode_init_once(struct inode *inode) ...@@ -143,7 +143,6 @@ void inode_init_once(struct inode *inode)
INIT_LIST_HEAD(&inode->i_dirty_data_buffers); INIT_LIST_HEAD(&inode->i_dirty_data_buffers);
INIT_LIST_HEAD(&inode->i_devices); INIT_LIST_HEAD(&inode->i_devices);
sema_init(&inode->i_sem, 1); sema_init(&inode->i_sem, 1);
sema_init(&inode->i_zombie, 1);
spin_lock_init(&inode->i_data.i_shared_lock); spin_lock_init(&inode->i_data.i_shared_lock);
} }
......
This diff is collapsed.
...@@ -466,7 +466,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd) ...@@ -466,7 +466,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd)
return -ENOTDIR; return -ENOTDIR;
err = -ENOENT; err = -ENOENT;
down(&nd->dentry->d_inode->i_zombie); down(&nd->dentry->d_inode->i_sem);
if (IS_DEADDIR(nd->dentry->d_inode)) if (IS_DEADDIR(nd->dentry->d_inode))
goto out_unlock; goto out_unlock;
...@@ -481,7 +481,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd) ...@@ -481,7 +481,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd)
} }
spin_unlock(&dcache_lock); spin_unlock(&dcache_lock);
out_unlock: out_unlock:
up(&nd->dentry->d_inode->i_zombie); up(&nd->dentry->d_inode->i_sem);
return err; return err;
} }
...@@ -577,7 +577,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name) ...@@ -577,7 +577,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name)
goto out; goto out;
err = -ENOENT; err = -ENOENT;
down(&nd->dentry->d_inode->i_zombie); down(&nd->dentry->d_inode->i_sem);
if (IS_DEADDIR(nd->dentry->d_inode)) if (IS_DEADDIR(nd->dentry->d_inode))
goto out1; goto out1;
...@@ -607,7 +607,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name) ...@@ -607,7 +607,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name)
out2: out2:
spin_unlock(&dcache_lock); spin_unlock(&dcache_lock);
out1: out1:
up(&nd->dentry->d_inode->i_zombie); up(&nd->dentry->d_inode->i_sem);
out: out:
up_write(&current->namespace->sem); up_write(&current->namespace->sem);
if (!err) if (!err)
...@@ -949,7 +949,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old) ...@@ -949,7 +949,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
user_nd.dentry = dget(current->fs->root); user_nd.dentry = dget(current->fs->root);
read_unlock(&current->fs->lock); read_unlock(&current->fs->lock);
down_write(&current->namespace->sem); down_write(&current->namespace->sem);
down(&old_nd.dentry->d_inode->i_zombie); down(&old_nd.dentry->d_inode->i_sem);
error = -EINVAL; error = -EINVAL;
if (!check_mnt(user_nd.mnt)) if (!check_mnt(user_nd.mnt))
goto out2; goto out2;
...@@ -992,7 +992,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old) ...@@ -992,7 +992,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
path_release(&root_parent); path_release(&root_parent);
path_release(&parent_nd); path_release(&parent_nd);
out2: out2:
up(&old_nd.dentry->d_inode->i_zombie); up(&old_nd.dentry->d_inode->i_sem);
up_write(&current->namespace->sem); up_write(&current->namespace->sem);
path_release(&user_nd); path_release(&user_nd);
path_release(&old_nd); path_release(&old_nd);
......
...@@ -1226,7 +1226,7 @@ int ...@@ -1226,7 +1226,7 @@ int
nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
struct svc_fh *tfhp, char *tname, int tlen) struct svc_fh *tfhp, char *tname, int tlen)
{ {
struct dentry *fdentry, *tdentry, *odentry, *ndentry; struct dentry *fdentry, *tdentry, *odentry, *ndentry, *trap;
struct inode *fdir, *tdir; struct inode *fdir, *tdir;
int err; int err;
...@@ -1253,7 +1253,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, ...@@ -1253,7 +1253,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
/* cannot use fh_lock as we need deadlock protective ordering /* cannot use fh_lock as we need deadlock protective ordering
* so do it by hand */ * so do it by hand */
double_down(&tdir->i_sem, &fdir->i_sem); trap = lock_rename(tdentry, fdentry);
ffhp->fh_locked = tfhp->fh_locked = 1; ffhp->fh_locked = tfhp->fh_locked = 1;
fill_pre_wcc(ffhp); fill_pre_wcc(ffhp);
fill_pre_wcc(tfhp); fill_pre_wcc(tfhp);
...@@ -1266,12 +1266,17 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, ...@@ -1266,12 +1266,17 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
err = -ENOENT; err = -ENOENT;
if (!odentry->d_inode) if (!odentry->d_inode)
goto out_dput_old; goto out_dput_old;
err = -EINVAL;
if (odentry == trap)
goto out_dput_old;
ndentry = lookup_one_len(tname, tdentry, tlen); ndentry = lookup_one_len(tname, tdentry, tlen);
err = PTR_ERR(ndentry); err = PTR_ERR(ndentry);
if (IS_ERR(ndentry)) if (IS_ERR(ndentry))
goto out_dput_old; goto out_dput_old;
err = -ENOTEMPTY;
if (ndentry == trap)
goto out_dput_new;
#ifdef MSNFS #ifdef MSNFS
if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) && if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
...@@ -1287,6 +1292,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, ...@@ -1287,6 +1292,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
} }
dput(ndentry); dput(ndentry);
out_dput_new:
dput(ndentry);
out_dput_old: out_dput_old:
dput(odentry); dput(odentry);
out_nfserr: out_nfserr:
...@@ -1299,7 +1306,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, ...@@ -1299,7 +1306,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
*/ */
fill_post_wcc(ffhp); fill_post_wcc(ffhp);
fill_post_wcc(tfhp); fill_post_wcc(tfhp);
double_up(&tdir->i_sem, &fdir->i_sem); unlock_rename(tdentry, fdentry);
ffhp->fh_locked = tfhp->fh_locked = 0; ffhp->fh_locked = tfhp->fh_locked = 0;
out: out:
......
...@@ -21,14 +21,12 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf) ...@@ -21,14 +21,12 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)
if (!file->f_op || !file->f_op->readdir) if (!file->f_op || !file->f_op->readdir)
goto out; goto out;
down(&inode->i_sem); down(&inode->i_sem);
down(&inode->i_zombie);
res = -ENOENT; res = -ENOENT;
if (!IS_DEADDIR(inode)) { if (!IS_DEADDIR(inode)) {
lock_kernel(); lock_kernel();
res = file->f_op->readdir(file, buf, filler); res = file->f_op->readdir(file, buf, filler);
unlock_kernel(); unlock_kernel();
} }
up(&inode->i_zombie);
up(&inode->i_sem); up(&inode->i_sem);
out: out:
return res; return res;
......
...@@ -425,7 +425,6 @@ struct inode { ...@@ -425,7 +425,6 @@ struct inode {
unsigned long i_blocks; unsigned long i_blocks;
unsigned long i_version; unsigned long i_version;
struct semaphore i_sem; struct semaphore i_sem;
struct semaphore i_zombie;
struct inode_operations *i_op; struct inode_operations *i_op;
struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct file_operations *i_fop; /* former ->i_op->default_file_ops */
struct super_block *i_sb; struct super_block *i_sb;
...@@ -759,6 +758,9 @@ extern int vfs_rmdir(struct inode *, struct dentry *); ...@@ -759,6 +758,9 @@ extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *); extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);
/* /*
* File types * File types
*/ */
...@@ -1505,131 +1507,6 @@ extern int generic_osync_inode(struct inode *, int); ...@@ -1505,131 +1507,6 @@ extern int generic_osync_inode(struct inode *, int);
extern int inode_change_ok(struct inode *, struct iattr *); extern int inode_change_ok(struct inode *, struct iattr *);
extern int inode_setattr(struct inode *, struct iattr *); extern int inode_setattr(struct inode *, struct iattr *);
/*
* Common dentry functions for inclusion in the VFS
* or in other stackable file systems. Some of these
* functions were in linux/fs/ C (VFS) files.
*
*/
/*
* Locking the parent is needed to:
* - serialize directory operations
* - make sure the parent doesn't change from
* under us in the middle of an operation.
*
* NOTE! Right now we'd rather use a "struct inode"
* for this, but as I expect things to move toward
* using dentries instead for most things it is
* probably better to start with the conceptually
* better interface of relying on a path of dentries.
*/
static inline struct dentry *lock_parent(struct dentry *dentry)
{
struct dentry *dir = dget(dentry->d_parent);
down(&dir->d_inode->i_sem);
return dir;
}
static inline struct dentry *get_parent(struct dentry *dentry)
{
return dget(dentry->d_parent);
}
static inline void unlock_dir(struct dentry *dir)
{
up(&dir->d_inode->i_sem);
dput(dir);
}
/*
* Whee.. Deadlock country. Happily there are only two VFS
* operations that does this..
*/
static inline void double_down(struct semaphore *s1, struct semaphore *s2)
{
if (s1 != s2) {
if ((unsigned long) s1 < (unsigned long) s2) {
struct semaphore *tmp = s2;
s2 = s1; s1 = tmp;
}
down(s1);
}
down(s2);
}
/*
* Ewwwwwwww... _triple_ lock. We are guaranteed that the 3rd argument is
* not equal to 1st and not equal to 2nd - the first case (target is parent of
* source) would be already caught, the second is plain impossible (target is
* its own parent and that case would be caught even earlier). Very messy.
* I _think_ that it works, but no warranties - please, look it through.
* Pox on bloody lusers who mandated overwriting rename() for directories...
*/
static inline void triple_down(struct semaphore *s1,
struct semaphore *s2,
struct semaphore *s3)
{
if (s1 != s2) {
if ((unsigned long) s1 < (unsigned long) s2) {
if ((unsigned long) s1 < (unsigned long) s3) {
struct semaphore *tmp = s3;
s3 = s1; s1 = tmp;
}
if ((unsigned long) s1 < (unsigned long) s2) {
struct semaphore *tmp = s2;
s2 = s1; s1 = tmp;
}
} else {
if ((unsigned long) s1 < (unsigned long) s3) {
struct semaphore *tmp = s3;
s3 = s1; s1 = tmp;
}
if ((unsigned long) s2 < (unsigned long) s3) {
struct semaphore *tmp = s3;
s3 = s2; s2 = tmp;
}
}
down(s1);
} else if ((unsigned long) s2 < (unsigned long) s3) {
struct semaphore *tmp = s3;
s3 = s2; s2 = tmp;
}
down(s2);
down(s3);
}
static inline void double_up(struct semaphore *s1, struct semaphore *s2)
{
up(s1);
if (s1 != s2)
up(s2);
}
static inline void triple_up(struct semaphore *s1,
struct semaphore *s2,
struct semaphore *s3)
{
up(s1);
if (s1 != s2)
up(s2);
up(s3);
}
static inline void double_lock(struct dentry *d1, struct dentry *d2)
{
double_down(&d1->d_inode->i_sem, &d2->d_inode->i_sem);
}
static inline void double_unlock(struct dentry *d1, struct dentry *d2)
{
double_up(&d1->d_inode->i_sem,&d2->d_inode->i_sem);
dput(d1);
dput(d2);
}
#endif /* __KERNEL__ */ #endif /* __KERNEL__ */
#endif /* _LINUX_FS_H */ #endif /* _LINUX_FS_H */
...@@ -253,6 +253,8 @@ EXPORT_SYMBOL(vfs_statfs); ...@@ -253,6 +253,8 @@ EXPORT_SYMBOL(vfs_statfs);
EXPORT_SYMBOL(vfs_fstat); EXPORT_SYMBOL(vfs_fstat);
EXPORT_SYMBOL(vfs_stat); EXPORT_SYMBOL(vfs_stat);
EXPORT_SYMBOL(vfs_lstat); EXPORT_SYMBOL(vfs_lstat);
EXPORT_SYMBOL(lock_rename);
EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(generic_read_dir); EXPORT_SYMBOL(generic_read_dir);
EXPORT_SYMBOL(generic_file_llseek); EXPORT_SYMBOL(generic_file_llseek);
EXPORT_SYMBOL(remote_llseek); EXPORT_SYMBOL(remote_llseek);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment