Commits · d489cf9a3e88aafc6c9f55c36ff7bbca2c25d54a · Kirill Smelkov / linux

02 Apr, 2020 20 commits

atomic_open(): no need to pass struct open_flags anymore · d489cf9a

Al Viro authored Mar 11, 2020

argument had been unused since 1643b43f (lookup_open(): lift the
"fallback to !O_CREAT" logics from atomic_open()) back in 2016
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d489cf9a

open_last_lookups(): move complete_walk() into do_open() · ff326a32
Al Viro authored Mar 10, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
ff326a32

open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open() · b94e0b32

Al Viro authored Mar 10, 2020

Currently path_openat() has "EEXIST on O_EXCL|O_CREAT" checks done on one
of the ways out of open_last_lookups().  There are 4 cases:
	1) the last component is . or ..; check is not done.
	2) we had FMODE_OPENED or FMODE_CREATED set while in lookup_open();
check is not done.
	3) symlink to be traversed is found; check is not done (nor
should it be)
	4) everything else: check done (before complete_walk(), even).

In case (1) O_EXCL|O_CREAT ends up failing with -EISDIR - that's
	open("/tmp/.", O_CREAT|O_EXCL, 0600)
Note that in the same conditions
	open("/tmp", O_CREAT|O_EXCL, 0600)
would have yielded EEXIST.  Either error is allowed, switching to -EEXIST
in these cases would've been more consistent.

Case (2) is more subtle; first of all, if we have FMODE_CREATED set, the
object hadn't existed prior to the call.  The check should not be done in
such a case.  The rest is problematic, though - we have
	FMODE_OPENED set (i.e. it went through ->atomic_open() and got
successfully opened there)
	FMODE_CREATED is *NOT* set
	O_CREAT and O_EXCL are both set.
Any such case is a bug - either we failed to set FMODE_CREATED when we
had, in fact, created an object (no such instances in the tree) or
we have opened a pre-existing file despite having had both O_CREAT and
O_EXCL passed.  One of those was, in fact caught (and fixed) while
sorting out this mess (gfs2 on cold dcache).  And in such situations
we should fail with EEXIST.

Note that for (1) and (4) FMODE_CREATED is not set - for (1) there's nothing
in handle_dots() to set it, for (4) we'd explicitly checked that.

And (1), (2) and (4) are exactly the cases when we leave the loop in
the caller, with do_open() called immediately after that loop.  IOW, we
can move the check over there, and make it

	If we have O_CREAT|O_EXCL and after successful pathname resolution
FMODE_CREATED is *not* set, we must have run into a preexisting file and
should fail with EEXIST.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b94e0b32

open_last_lookups(): don't abuse complete_walk() when all we want is unlazy · 72287417
Al Viro authored Mar 10, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
72287417
open_last_lookups(): consolidate fsnotify_create() calls · f7bb959d
Al Viro authored Mar 05, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
f7bb959d

take post-lookup part of do_last() out of loop · c5971b8c

Al Viro authored Mar 05, 2020

now we can have open_last_lookups() directly from the loop in
path_openat() - the rest of do_last() never returns a symlink
to follow, so we can bloody well leave the loop first.

Rename the rest of that thing from do_last() to do_open() and
make it return an int.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c5971b8c

link_path_walk(): sample parent's i_uid and i_mode for the last component · 0f705953
Al Viro authored Mar 05, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
0f705953

__nd_alloc_stack(): make it return bool · 60ef60c7

Al Viro authored Mar 03, 2020

... and adjust the caller (reserve_stack()).  Rename to nd_alloc_stack(),
while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

60ef60c7

reserve_stack(): switch to __nd_alloc_stack() · 4542576b

Al Viro authored Mar 03, 2020

expand the call of nd_alloc_stack() into it (and don't
recheck the depth on the second call)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4542576b

pick_link(): take reserving space on stack into a new helper · 49055906
Al Viro authored Mar 03, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
49055906

pick_link(): more straightforward handling of allocation failures · aef9404d

Al Viro authored Mar 02, 2020

pick_link() needs to push onto stack; we start with using two-element
array embedded into struct nameidata and the first time we need
more than that we switch to separately allocated array.

Allocation can fail, of course, and handling of that would be simple
enough - we need to drop 'link' and bugger off.  However, the things
get more complicated in RCU mode.  There we must do GFP_ATOMIC
allocation.  If that fails, we try to switch to non-RCU mode and
repeat the allocation.

To switch to non-RCU mode we need to grab references to 'link' and
to everything in nameidata.  The latter done by unlazy_walk();
the former - legitimize_path().  'link' must go first - after
unlazy_walk() we are out of RCU-critical period and it's too
late to call legitimize_path() since the references in link->mnt
and link->dentry might be pointing to freed and reused memory.

So we do legitimize_path(), then unlazy_walk().  And that's where
it gets too subtle: what to do if the former fails?  We MUST
do path_put(link) to avoid leaks.  And we can't do that under
rcu_read_lock().  Solution in mainline was to empty then nameidata
manually, drop out of RCU mode and then do put_path().

In effect, we open-code the things eventual terminate_walk()
would've done on error in RCU mode.  That looks badly out of place
and confusing.  We could add a comment along the lines of the
explanation above, but... there's a simpler solution.  Call
unlazy_walk() even if legitimaze_path() fails.  It will take
us out of RCU mode, so we'll be able to do path_put(link).

Yes, it will do unnecessary work - attempt to grab references
on the stuff in nameidata, only to have them dropped as soon
as we return the error to upper layer and get terminate_walk()
called there.  So what?  We are thoroughly off the fast path
by that point - we had GFP_ATOMIC allocation fail, we had
->d_seq or mount_lock mismatch and we are about to try walking
the same path from scratch in non-RCU mode.  Which will need
to do the same allocation, this time with GFP_KERNEL, so it will
be able to apply memory pressure for blocking stuff.

Compared to that the cost of several lockref_get_not_dead()
is noise.  And the logics become much easier to understand
that way.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

aef9404d

fold path_to_nameidata() into its only remaining caller · c99687a0
Al Viro authored Mar 03, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
c99687a0

pick_link(): pass it struct path already with normal refcounting rules · 84f0cd9e

Al Viro authored Mar 03, 2020

step_into() tries to avoid grabbing and dropping mount references
on the steps that do not involve crossing mountpoints (which is
obviously the majority of cases).  So it uses a local struct path
with unusual refcounting rules - path.mnt is pinned if and only if
it's not equal to nd->path.mnt.

We used to have similar beasts all over the place and we had quite
a few bugs crop up in their handling - it's easy to get confused
when changing e.g. cleanup on failure exits (or adding a new check,
etc.)

Now that's mostly gone - the step_into() instance (which is what
we need them for) is the only one left.  It is exposed to mount
traversal and it's (shortly) seen by pick_link().  Since pick_link()
needs to store it in link stack, where the normal rules apply,
it has to make sure that mount is pinned regardless of nd->path.mnt
value.  That's done on all calls of pick_link() and very early
in those.  Let's do that in the caller (step_into()) instead -
that way the fewer places need to be aware of such struct path
instances.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

84f0cd9e

fs/namei.c: kill follow_mount() · 19f6028a

Al Viro authored Feb 26, 2020

The only remaining caller (path_pts()) should be using follow_down()
anyway.  And clean path_pts() a bit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

19f6028a

non-RCU analogue of the previous commit · 2aa38470

Al Viro authored Feb 26, 2020

new helper: choose_mountpoint(). Wrapper around choose_mountpoint_rcu(),
similar to lookup_mnt() vs. __lookup_mnt(). follow_dotdot() switched to
it. Now we don't grab mount_lock exclusive anymore; note that the
primitive used non-RCU mount traversals in other direction (lookup_mnt())
doesn't bother with that either - it uses mount_lock seqcount instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2aa38470

helper for mount rootwards traversal · 7ef482fa

Al Viro authored Feb 26, 2020

The loops in follow_dotdot{_rcu()} are doing the same thing:
we have a mount and we want to find out how far up the chain
of mounts do we need to go.

We follow the chain of mount until we find one that is not
directly overmounting the root of another mount.  If such
a mount is found, we want the location it's mounted upon.
If we run out of chain (i.e. get to a mount that is not
mounted on anything else) or run into process' root, we
report failure.

On success, we want (in RCU case) d_seq of resulting location
sampled or (in non-RCU case) references to that location
acquired.

This commit introduces such primitive for RCU case and
switches follow_dotdot_rcu() to it; non-RCU case will be
go in the next commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7ef482fa

follow_dotdot(): be lazy about changing nd->path · 165200d6

Al Viro authored Feb 28, 2020

Change nd->path only after the loop is done and only in case we hadn't
ended up finding ourselves in root.  Same for NO_XDEV check.

That separates the "check how far back do we need to go through the
mount stack" logics from the rest of .. traversal.

NOTE: path_get/path_put introduced here are temporary.  They will
go away later in the series.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

165200d6

follow_dotdot_rcu(): be lazy about changing nd->path · efe772d6

Al Viro authored Feb 28, 2020

Change nd->path only after the loop is done and only in case we hadn't
ended up finding ourselves in root.  Same for NO_XDEV check.  Don't
recheck mount_lock on each step either.

That separates the "check how far back do we need to go through the
mount stack" logics from the rest of .. traversal.

Note that the sequence for d_seq/d_inode here is
	* sample mount_lock seqcount
...
	* sample d_seq
	* fetch d_inode
	* verify mount_lock seqcount
The last step makes sure that d_inode value we'd got matches d_seq -
it dentry is guaranteed to have been a mountpoint through the
entire thing, so its d_inode must have been stable.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

efe772d6

follow_dotdot{,_rcu}(): massage loops · 12487f30

Al Viro authored Feb 26, 2020

The logics in both of them is the same:
	while true
		if in process' root	// uncommon
			break
		if *not* in mount root	// normal case
			find the parent
			return
		if at absolute root	// very uncommon
			break
		move to underlying mountpoint
	report that we are in root

Pull the common path out of the loop:
	if in process' root		// uncommon
		goto in_root
	if unlikely(in mount root)
		while true
			if at absolute root
				goto in_root
			move to underlying mountpoint
			if in process' root
				goto in_root
			if in mount root
				break;
	find the parent	// we are not in mount root
	return
in_root:
	report that we are in root

The reason for that transformation is that we get to keep the
common path straight *and* get a separate block for "move
through underlying mountpoints", which will allow to sanitize
NO_XDEV handling there.  What's more, the pared-down loops
will be easier to deal with - in particular, non-RCU case
has no need to grab mount_lock and rewriting it to the
form that wouldn't do that is a non-trivial change.  Better
do that with less stuff getting in the way...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

12487f30

lift all calls of step_into() out of follow_dotdot/follow_dotdot_rcu · c2df1968

Al Viro authored Feb 26, 2020

lift step_into() into handle_dots() (where they merge with each other);
have follow_... return dentry and pass inode/seq to the caller.

[braino fix folded; kudos to Qian Cai <cai@lca.pw> for reporting it]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c2df1968

14 Mar, 2020 20 commits

follow_dotdot{,_rcu}(): switch to use of step_into() · 6dfd9fe5

Al Viro authored Feb 26, 2020

gets the regular mount crossing on result of ..
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6dfd9fe5

handle_dots(), follow_dotdot{,_rcu}(): preparation to switch to step_into() · 7521f22b

Al Viro authored Feb 26, 2020

Right now the tail ends of follow_dotdot{,_rcu}() are pretty
much the open-coded analogues of step_into().  The differences:
	* the lack of proper LOOKUP_NO_XDEV handling in non-RCU case
(arguably a bug)
	* the lack of ->d_manage() handling (again, arguably a bug)

Adjust the calling conventions so that on the next step with could
just switch those functions to returning step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7521f22b

move handle_dots(), follow_dotdot() and follow_dotdot_rcu() past step_into() · 957dd41d
Al Viro authored Feb 26, 2020
```
pure move; we are going to have step_into() called by that bunch.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
957dd41d

follow_dotdot{,_rcu}(): lift LOOKUP_BENEATH checks out of loop · c9a0f75d

Al Viro authored Feb 26, 2020

Behaviour change: LOOKUP_BENEATH lookup of .. in absolute root
yields an error even if it's not the process' root.  That's
possible only if you'd managed to escape chroot jail by way of
procfs symlinks, but IMO the resulting behaviour is not worse -
more consistent and easier to describe:
	".." in root is "stay where you are", uness LOOKUP_BENEATH
	has been given, in which case it's "fail with EXDEV".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c9a0f75d

follow_dotdot{,_rcu}(): lift switching nd->path to parent out of loop · abc2c632
Al Viro authored Feb 26, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
abc2c632
expand path_parent_directory() in its callers · a6a7eb76
Al Viro authored Mar 11, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
a6a7eb76

path_parent_directory(): leave changing path->dentry to callers · 63b27720

Al Viro authored Feb 24, 2020

Instead of returning 0, return new dentry; instead of returning
-ENOENT, return NULL.  Adjust the callers accordingly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

63b27720

path_connected(): pass mount and dentry separately · 6b03f7ed

Al Viro authored Feb 24, 2020

eventually we'll want to do that check *before* mangling
nd->path.dentry...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6b03f7ed

split the lookup-related parts of do_last() into a separate helper · c981a482
Al Viro authored Jan 26, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
c981a482
do_last(): rejoin the common path even earlier in FMODE_{OPENED,CREATED} case · 973d4b73
Al Viro authored Jan 26, 2020
```
... getting may_create_in_sticky() checks in FMODE_OPENED case as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
973d4b73

do_last(): simplify the liveness analysis past finish_open_created · 8795e7d4

Al Viro authored Jan 26, 2020

Don't mess with got_write there - it is guaranteed to be false on
entry and it will be set true if and only if we decide to go for
truncation and manage to get write access for that.

Don't carry acc_mode through the entire thing - it's only used
in that part.  And don't bother with gotos in there - compiler is
quite capable of optimizing that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8795e7d4

do_last(): rejoing the common path earlier in FMODE_{OPENED,CREATED} case · 5a2d3edd
Al Viro authored Jan 26, 2020
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
5a2d3edd

do_last(): don't bother with keeping got_write in FMODE_OPENED case · 59e96e65

Al Viro authored Jan 26, 2020

it's easier to drop it right after lookup_open() and regain if
needed (i.e. if we will need to truncate).  On the non-FMODE_OPENED
path we do that anyway.  In case of FMODE_CREATED we won't be
needing it.  And it's easier to prove correctness that way,
especially since the initial failure to get write access is not
always fatal; proving that we'll never end up truncating in that
case is rather convoluted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

59e96e65

do_last(): merge the may_open() calls · 3ad5615a

Al Viro authored Jan 26, 2020

have FMODE_OPENED case rejoin the main path at earlier point
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3ad5615a

atomic_open(): lift the call of may_open() into do_last() · 7be219b4

Al Viro authored Jan 26, 2020

there we'll be able to merge it with its counterparts in other
cases, and there's no reason to do it before the parent has
been unlocked
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7be219b4

atomic_open(): return the right dentry in FMODE_OPENED case · 6fb968cd

Al Viro authored Jan 26, 2020

->atomic_open() might have used a different alias than the one we'd
passed to it; in "not opened" case we take care of that, in "opened"
one we don't.  Currently we don't care downstream of "opened" case
which alias to return; however, that will change shortly when we
get to unifying may_open() calls.

It's not hard to get right in all cases, anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6fb968cd

new helper: traverse_mounts() · 9deed3eb

Al Viro authored Jan 17, 2020

common guts of follow_down() and follow_managed() taken to a new
helper - traverse_mounts().  The remnants of follow_managed()
are folded into its sole remaining caller (handle_mounts()).
Calling conventions of handle_mounts() slightly sanitized -
instead of the weird "1 for success, -E... for failure" that used
to be imposed by the calling conventions of walk_component() et.al.
we can use the normal "0 for success, -E... for failure".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9deed3eb

massage __follow_mount_rcu() a bit · ea936aeb

Al Viro authored Jan 16, 2020

make the loop more similar to that in follow_managed(), with
explicit tracking of flags, etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ea936aeb

namei: have link_path_walk() maintain LOOKUP_PARENT · c108837e

Al Viro authored Mar 05, 2020

set on entry, clear when we get to the last component.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c108837e

link_path_walk(): simplify stack handling · d8d4611a

Al Viro authored Feb 23, 2020

We use nd->stack to store two things: pinning down the symlinks
we are resolving and resuming the name traversal when a nested
symlink is finished.

Currently, nd->depth is used to keep track of both.  It's 0 when
we call link_path_walk() for the first time (for the pathname
itself) and 1 on all subsequent calls (for trailing symlinks,
if any).  That's fine, as far as pinning symlinks goes - when
handling a trailing symlink, the string we are interpreting
is the body of symlink pinned down in nd->stack[0].  It's
rather inconvenient with respect to handling nested symlinks,
though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case
we need to pick the string saved back when we started to interpret
that nested symlink and resume its traversal) or not (in which
case we are done with link_path_walk()).

Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when
running out of a string to interpret:
	* if nd->depth is zero, we are at the end of pathname itself.
	* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.

It works, but it's rather non-obvious.  Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails.  The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones.  The kludge is basically
a way to recognize the second set becoming empty.

The things get simpler if we keep track of the second set's size
explicitly and always store it in nd->stack[0..depth-1].name.
We access the second set only inside link_path_walk(), so its
size can live in a local variable; that way the check becomes
trivial without the need of that kludge.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d8d4611a