Commit 1428cc0e authored by NeilBrown's avatar NeilBrown Committed by Jonathan Corbet

Documentation: update path-lookup.md for parallel lookups

Since this document was written, i_mutex has been replace with
i_rwsem, and shared locks are utilized to allow lookups in the one
directory to happen in parallel.

So replace i_mutex with i_rwsem, and explain how this is used for
parallel lookups.
Signed-off-by: default avatarNeilBrown <neilb@suse.com>
Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
parent 806654a9
...@@ -12,6 +12,10 @@ This write-up is based on three articles published at lwn.net: ...@@ -12,6 +12,10 @@ This write-up is based on three articles published at lwn.net:
- <https://lwn.net/Articles/650786/> A walk among the symlinks - <https://lwn.net/Articles/650786/> A walk among the symlinks
Written by Neil Brown with help from Al Viro and Jon Corbet. Written by Neil Brown with help from Al Viro and Jon Corbet.
It has subsequently been updated to reflect changes in the kernel
including:
- per-directory parallel name lookup.
Introduction Introduction
------------ ------------
...@@ -231,37 +235,80 @@ renamed. If `d_lookup` finds that a rename happened while it ...@@ -231,37 +235,80 @@ renamed. If `d_lookup` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries unsuccessfully scanned a chain in the hash table, it simply tries
again. again.
### inode->i_mutex ### ### inode->i_rwsem ###
`i_mutex` is a mutex that serializes all changes to a particular `i_rwsem` is a read/write semaphore that serializes all changes to a particular
directory. This ensures that, for example, an `unlink()` and a `rename()` directory. This ensures that, for example, an `unlink()` and a `rename()`
cannot both happen at the same time. It also keeps the directory cannot both happen at the same time. It also keeps the directory
stable while the filesystem is asked to look up a name that is not stable while the filesystem is asked to look up a name that is not
currently in the dcache. currently in the dcache or, optionally, when the list of entries in a
directory is being retrieved with `readdir()`.
This has a complementary role to that of `d_lock`: `i_mutex` on a This has a complementary role to that of `d_lock`: `i_rwsem` on a
directory protects all of the names in that directory, while `d_lock` directory protects all of the names in that directory, while `d_lock`
on a name protects just one name in a directory. Most changes to the on a name protects just one name in a directory. Most changes to the
dcache hold `i_mutex` on the relevant directory inode and briefly take dcache hold `i_rwsem` on the relevant directory inode and briefly take
`d_lock` on one or more the dentries while the change happens. One `d_lock` on one or more the dentries while the change happens. One
exception is when idle dentries are removed from the dcache due to exception is when idle dentries are removed from the dcache due to
memory pressure. This uses `d_lock`, but `i_mutex` plays no role. memory pressure. This uses `d_lock`, but `i_rwsem` plays no role.
The mutex affects pathname lookup in two distinct ways. Firstly it The semaphore affects pathname lookup in two distinct ways. Firstly it
serializes lookup of a name in a directory. `walk_component()` uses prevents changes during lookup of a name in a directory. `walk_component()` uses
`lookup_fast()` first which, in turn, checks to see if the name is in the cache, `lookup_fast()` first which, in turn, checks to see if the name is in the cache,
using only `d_lock` locking. If the name isn't found, then `walk_component()` using only `d_lock` locking. If the name isn't found, then `walk_component()`
falls back to `lookup_slow()` which takes `i_mutex`, checks again that falls back to `lookup_slow()` which takes a shared lock on `i_rwsem`, checks again that
the name isn't in the cache, and then calls in to the filesystem to get a the name isn't in the cache, and then calls in to the filesystem to get a
definitive answer. A new dentry will be added to the cache regardless of definitive answer. A new dentry will be added to the cache regardless of
the result. the result.
Secondly, when pathname lookup reaches the final component, it will Secondly, when pathname lookup reaches the final component, it will
sometimes need to take `i_mutex` before performing the last lookup so sometimes need to take an exclusive lock on `i_rwsem` before performing the last lookup so
that the required exclusion can be achieved. How path lookup chooses that the required exclusion can be achieved. How path lookup chooses
to take, or not take, `i_mutex` is one of the to take, or not take, `i_rwsem` is one of the
issues addressed in a subsequent section. issues addressed in a subsequent section.
If two threads attempt to look up the same name at the same time - a
name that is not yet in the dcache - the shared lock on `i_rwsem` will
not prevent them both adding new dentries with the same name. As this
would result in confusion an extra level of interlocking is used,
based around a secondary hash table (`in_lookup_hashtable`) and a
per-dentry flag bit (`DCACHE_PAR_LOOKUP`).
To add a new dentry to the cache while only holding a shared lock on
`i_rwsem`, a thread must call `d_alloc_parallel()`. This allocates a
dentry, stores the required name and parent in it, checks if there
is already a matching dentry in the primary or secondary hash
tables, and if not, stores the newly allocated dentry in the secondary
hash table, with `DCACHE_PAR_LOOKUP` set.
If a matching dentry was found in the primary hash table then that is
returned and the caller can know that it lost a race with some other
thread adding the entry. If no matching dentry is found in either
cache, the newly allocated dentry is returned and the caller can
detect this from the presence of `DCACHE_PAR_LOOKUP`. In this case it
knows that it has won any race and now is responsible for asking the
filesystem to perform the lookup and find the matching inode. When
the lookup is complete, it must call `d_lookup_done()` which clears
the flag and does some other house keeping, including removing the
dentry from the secondary hash table - it will normally have been
added to the primary hash table already. Note that a `struct
waitqueue_head` is passed to `d_alloc_parallel()`, and
`d_lookup_done()` must be called while this `waitqueue_head` is still
in scope.
If a matching dentry is found in the secondary hash table,
`d_alloc_parallel()` has a little more work to do. It first waits for
`DCACHE_PAR_LOOKUP` to be cleared, using a wait_queue that was passed
to the instance of `d_alloc_parallel()` that won the race and that
will be woken by the call to `d_lookup_done()`. It then checks to see
if the dentry has now been added to the primary hash table. If it
has, the dentry is returned and the caller just sees that it lost any
race. If it hasn't been added to the primary hash table, the most
likely explanation is that some other dentry was added instead using
`d_splice_alias()`. In any case, `d_alloc_parallel()` repeats all the
look ups from the start and will normally return something from the
primary hash table.
### mnt->mnt_count ### ### mnt->mnt_count ###
`mnt_count` is a per-CPU reference counter on "`mount`" structures. `mnt_count` is a per-CPU reference counter on "`mount`" structures.
...@@ -376,7 +423,7 @@ described. If it finds a `LAST_NORM` component it first calls ...@@ -376,7 +423,7 @@ described. If it finds a `LAST_NORM` component it first calls
"`lookup_fast()`" which only looks in the dcache, but will ask the "`lookup_fast()`" which only looks in the dcache, but will ask the
filesystem to revalidate the result if it is that sort of filesystem. filesystem to revalidate the result if it is that sort of filesystem.
If that doesn't get a good result, it calls "`lookup_slow()`" which If that doesn't get a good result, it calls "`lookup_slow()`" which
takes the `i_mutex`, rechecks the cache, and then asks the filesystem takes `i_rwsem`, rechecks the cache, and then asks the filesystem
to find a definitive answer. Each of these will call to find a definitive answer. Each of these will call
`follow_managed()` (as described below) to handle any mount points. `follow_managed()` (as described below) to handle any mount points.
...@@ -408,7 +455,7 @@ of housekeeping around `link_path_walk()` and returns the parent ...@@ -408,7 +455,7 @@ of housekeeping around `link_path_walk()` and returns the parent
directory and final component to the caller. The caller will be either directory and final component to the caller. The caller will be either
aiming to create a name (via `filename_create()`) or remove or rename aiming to create a name (via `filename_create()`) or remove or rename
a name (in which case `user_path_parent()` is used). They will use a name (in which case `user_path_parent()` is used). They will use
`i_mutex` to exclude other changes while they validate and then `i_rwsem` to exclude other changes while they validate and then
perform their operation. perform their operation.
`path_lookupat()` is nearly as simple - it is used when an existing `path_lookupat()` is nearly as simple - it is used when an existing
...@@ -429,7 +476,7 @@ complexity needed to handle the different subtleties of O_CREAT (with ...@@ -429,7 +476,7 @@ complexity needed to handle the different subtleties of O_CREAT (with
or without O_EXCL), final "`/`" characters, and trailing symbolic or without O_EXCL), final "`/`" characters, and trailing symbolic
links. We will revisit this in the final part of this series, which links. We will revisit this in the final part of this series, which
focuses on those symbolic links. "`do_last()`" will sometimes, but focuses on those symbolic links. "`do_last()`" will sometimes, but
not always, take `i_mutex`, depending on what it finds. not always, take `i_rwsem`, depending on what it finds.
Each of these, or the functions which call them, need to be alert to Each of these, or the functions which call them, need to be alert to
the possibility that the final component is not `LAST_NORM`. If the the possibility that the final component is not `LAST_NORM`. If the
...@@ -728,12 +775,12 @@ checking the `seq` number of the old exactly mirrors the process of ...@@ -728,12 +775,12 @@ checking the `seq` number of the old exactly mirrors the process of
getting a counted reference to the new dentry before dropping that for getting a counted reference to the new dentry before dropping that for
the old dentry which we saw in REF-walk. the old dentry which we saw in REF-walk.
### No `inode->i_mutex` or even `rename_lock` ### ### No `inode->i_rwsem` or even `rename_lock` ###
A mutex is a fairly heavyweight lock that can only be taken when it is A semaphore is a fairly heavyweight lock that can only be taken when it is
permissible to sleep. As `rcu_read_lock()` forbids sleeping, permissible to sleep. As `rcu_read_lock()` forbids sleeping,
`inode->i_mutex` plays no role in RCU-walk. If some other thread does `inode->i_rwsem` plays no role in RCU-walk. If some other thread does
take `i_mutex` and modifies the directory in a way that RCU-walk needs take `i_rwsem` and modifies the directory in a way that RCU-walk needs
to notice, the result will be either that RCU-walk fails to find the to notice, the result will be either that RCU-walk fails to find the
dentry that it is looking for, or it will find a dentry which dentry that it is looking for, or it will find a dentry which
`read_seqretry()` won't validate. In either case it will drop down to `read_seqretry()` won't validate. In either case it will drop down to
...@@ -1134,7 +1181,7 @@ and `do_last()`, each of which use the same convention as ...@@ -1134,7 +1181,7 @@ and `do_last()`, each of which use the same convention as
to be followed. to be followed.
Of these, `do_last()` is the most interesting as it is used for Of these, `do_last()` is the most interesting as it is used for
opening a file. Part of `do_last()` runs with `i_mutex` held and this opening a file. Part of `do_last()` runs with `i_rwsem` held and this
part is in a separate function: `lookup_open()`. part is in a separate function: `lookup_open()`.
Explaining `do_last()` completely is beyond the scope of this article, Explaining `do_last()` completely is beyond the scope of this article,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment