Commit c0a572d9 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:
 "This contains the work to extend move_mount() to allow adding a mount
  beneath the topmost mount of a mount stack.

  There are two LWN articles about this. One covers the original patch
  series in [1]. The other in [2] summarizes the session and roughly the
  discussion between Al and me at LSFMM. The second article also goes
  into some good questions from attendees.

  Since all details are found in the relevant commit with a technical
  dive into semantics and locking at the end I'm only adding the
  motivation and core functionality for this from commit message and
  leave out the invasive details. The code is also heavily commented and
  annotated as well which was explicitly requested.

  TL;DR:

    > mount -t ext4 /dev/sda /mnt
      |
      └─/mnt    /dev/sda    ext4

    > mount --beneath -t xfs /dev/sdb /mnt
      |
      └─/mnt    /dev/sdb    xfs
        └─/mnt  /dev/sda    ext4

    > umount /mnt
      |
      └─/mnt    /dev/sdb    xfs

  The longer motivation is that various distributions are adding or are
  in the process of adding support for system extensions and in the
  future configuration extensions through various tools. A more detailed
  explanation on system and configuration extensions can be found on the
  manpage which is listed below at [3].

  System extension images may – dynamically at runtime — extend the
  /usr/ and /opt/ directory hierarchies with additional files. This is
  particularly useful on immutable system images where a /usr/ and/or
  /opt/ hierarchy residing on a read-only file system shall be extended
  temporarily at runtime without making any persistent modifications.

  When one or more system extension images are activated, their /usr/
  and /opt/ hierarchies are combined via overlayfs with the same
  hierarchies of the host OS, and the host /usr/ and /opt/ overmounted
  with it ("merging"). When they are deactivated, the mount point is
  disassembled — again revealing the unmodified original host version of
  the hierarchy ("unmerging"). Merging thus makes the extension's
  resources suddenly appear below the /usr/ and /opt/ hierarchies as if
  they were included in the base OS image itself. Unmerging makes them
  disappear again, leaving in place only the files that were shipped
  with the base OS image itself.

  System configuration images are similar but operate on directories
  containing system or service configuration.

  On nearly all modern distributions mount propagation plays a crucial
  role and the rootfs of the OS is a shared mount in a peer group
  (usually with peer group id 1):

     TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /       /       ext4    shared:1     29      1

  On such systems all services and containers run in a separate mount
  namespace and are pivot_root()ed into their rootfs. A separate mount
  namespace is almost always used as it is the minimal isolation
  mechanism services have. But usually they are even much more isolated
  up to the point where they almost become indistinguishable from
  containers.

  Mount propagation again plays a crucial role here. The rootfs of all
  these services is a slave mount to the peer group of the host rootfs.
  This is done so the service will receive mount propagation events from
  the host when certain files or directories are updated.

  In addition, the rootfs of each service, container, and sandbox is
  also a shared mount in its separate peer group:

     TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
     /       /       ext4    shared:24 master:1  71      47

  For people not too familiar with mount propagation, the master:1 means
  that this is a slave mount to peer group 1. Which as one can see is
  the host rootfs as indicated by shared:1 above. The shared:24
  indicates that the service rootfs is a shared mount in a separate peer
  group with peer group id 24.

  A service may run other services. Such nested services will also have
  a rootfs mount that is a slave to the peer group of the outer service
  rootfs mount.

  For containers things are just slighly different. A container's rootfs
  isn't a slave to the service's or host rootfs' peer group. The rootfs
  mount of a container is simply a shared mount in its own peer group:

     TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

  So whereas services are isolated OS components a container is treated
  like a separate world and mount propagation into it is restricted to a
  single well known mount that is a slave to the peer group of the
  shared mount /run on the host:

     TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

  Here, the master:5 indicates that this mount is a slave to the peer
  group with peer group id 5. This allows to propagate mounts into the
  container and served as a workaround for not being able to insert
  mounts into mount namespaces directly. But the new mount api does
  support inserting mounts directly. For the interested reader the
  blogpost in [4] might be worth reading where I explain the old and the
  new approach to inserting mounts into mount namespaces.

  Containers of course, can themselves be run as services. They often
  run full systems themselves which means they again run services and
  containers with the exact same propagation settings explained above.

  The whole system is designed so that it can be easily updated,
  including all services in various fine-grained ways without having to
  enter every single service's mount namespace which would be
  prohibitively expensive. The mount propagation layout has been
  carefully chosen so it is possible to propagate updates for system
  extensions and configurations from the host into all services.

  The simplest model to update the whole system is to mount on top of
  /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
  will then propagate into every service. This works cleanly the first
  time. However, when the system is updated multiple times it becomes
  necessary to unmount the first update on /opt, /usr, /etc and then
  propagate the new update. But this means, there's an interval where
  the old base system is accessible. This has to be avoided to protect
  against downgrade attacks.

  The vfs already exposes a mechanism to userspace whereby mounts can be
  mounted beneath an existing mount. Such mounts are internally referred
  to as "tucked". The patch series exposes the ability to mount beneath
  a top mount through the new MOVE_MOUNT_BENEATH flag for the
  move_mount() system call. This allows userspace to seamlessly upgrade
  mounts. After this series the only thing that will have changed is
  that mounting beneath an existing mount can be done explicitly instead
  of just implicitly.

  The crux is that the proposed mechanism already exists and that it is
  so powerful as to cover cases where mounts are supposed to be updated
  with new versions. Crucially, it offers an important flexibility.
  Namely that updates to a system may either be forced or can be delayed
  and the umount of the top mount be left to a service if it is a
  cooperative one"

Link: https://lwn.net/Articles/927491 [1]
Link: https://lwn.net/Articles/934094 [2]
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

* tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: allow to mount beneath top mount
  fs: use a for loop when locking a mount
  fs: properly document __lookup_mnt()
  fs: add path_mounted()
parents 1f2300a7 6ac39281
...@@ -665,9 +665,25 @@ static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq) ...@@ -665,9 +665,25 @@ static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq)
return false; return false;
} }
/* /**
* find the first mount at @dentry on vfsmount @mnt. * __lookup_mnt - find first child mount
* call under rcu_read_lock() * @mnt: parent mount
* @dentry: mountpoint
*
* If @mnt has a child mount @c mounted @dentry find and return it.
*
* Note that the child mount @c need not be unique. There are cases
* where shadow mounts are created. For example, during mount
* propagation when a source mount @mnt whose root got overmounted by a
* mount @o after path lookup but before @namespace_sem could be
* acquired gets copied and propagated. So @mnt gets copied including
* @o. When @mnt is propagated to a destination mount @d that already
* has another mount @n mounted at the same mountpoint then the source
* mount @mnt will be tucked beneath @n, i.e., @n will be mounted on
* @mnt and @mnt mounted on @d. Now both @n and @o are mounted at @mnt
* on @dentry.
*
* Return: The first child of @mnt mounted @dentry or NULL.
*/ */
struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
{ {
...@@ -917,6 +933,33 @@ void mnt_set_mountpoint(struct mount *mnt, ...@@ -917,6 +933,33 @@ void mnt_set_mountpoint(struct mount *mnt,
hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
} }
/**
* mnt_set_mountpoint_beneath - mount a mount beneath another one
*
* @new_parent: the source mount
* @top_mnt: the mount beneath which @new_parent is mounted
* @new_mp: the new mountpoint of @top_mnt on @new_parent
*
* Remove @top_mnt from its current mountpoint @top_mnt->mnt_mp and
* parent @top_mnt->mnt_parent and mount it on top of @new_parent at
* @new_mp. And mount @new_parent on the old parent and old
* mountpoint of @top_mnt.
*
* Context: This function expects namespace_lock() and lock_mount_hash()
* to have been acquired in that order.
*/
static void mnt_set_mountpoint_beneath(struct mount *new_parent,
struct mount *top_mnt,
struct mountpoint *new_mp)
{
struct mount *old_top_parent = top_mnt->mnt_parent;
struct mountpoint *old_top_mp = top_mnt->mnt_mp;
mnt_set_mountpoint(old_top_parent, old_top_mp, new_parent);
mnt_change_mountpoint(new_parent, new_mp, top_mnt);
}
static void __attach_mnt(struct mount *mnt, struct mount *parent) static void __attach_mnt(struct mount *mnt, struct mount *parent)
{ {
hlist_add_head_rcu(&mnt->mnt_hash, hlist_add_head_rcu(&mnt->mnt_hash,
...@@ -924,15 +967,42 @@ static void __attach_mnt(struct mount *mnt, struct mount *parent) ...@@ -924,15 +967,42 @@ static void __attach_mnt(struct mount *mnt, struct mount *parent)
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
} }
/* /**
* vfsmount lock must be held for write * attach_mnt - mount a mount, attach to @mount_hashtable and parent's
* list of child mounts
* @parent: the parent
* @mnt: the new mount
* @mp: the new mountpoint
* @beneath: whether to mount @mnt beneath or on top of @parent
*
* If @beneath is false, mount @mnt at @mp on @parent. Then attach @mnt
* to @parent's child mount list and to @mount_hashtable.
*
* If @beneath is true, remove @mnt from its current parent and
* mountpoint and mount it on @mp on @parent, and mount @parent on the
* old parent and old mountpoint of @mnt. Finally, attach @parent to
* @mnt_hashtable and @parent->mnt_parent->mnt_mounts.
*
* Note, when __attach_mnt() is called @mnt->mnt_parent already points
* to the correct parent.
*
* Context: This function expects namespace_lock() and lock_mount_hash()
* to have been acquired in that order.
*/ */
static void attach_mnt(struct mount *mnt, static void attach_mnt(struct mount *mnt, struct mount *parent,
struct mount *parent, struct mountpoint *mp, bool beneath)
struct mountpoint *mp)
{ {
mnt_set_mountpoint(parent, mp, mnt); if (beneath)
__attach_mnt(mnt, parent); mnt_set_mountpoint_beneath(mnt, parent, mp);
else
mnt_set_mountpoint(parent, mp, mnt);
/*
* Note, @mnt->mnt_parent has to be used. If @mnt was mounted
* beneath @parent then @mnt will need to be attached to
* @parent's old parent, not @parent. IOW, @mnt->mnt_parent
* isn't the same mount as @parent.
*/
__attach_mnt(mnt, mnt->mnt_parent);
} }
void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
...@@ -944,7 +1014,7 @@ void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct m ...@@ -944,7 +1014,7 @@ void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct m
hlist_del_init(&mnt->mnt_mp_list); hlist_del_init(&mnt->mnt_mp_list);
hlist_del_init_rcu(&mnt->mnt_hash); hlist_del_init_rcu(&mnt->mnt_hash);
attach_mnt(mnt, parent, mp); attach_mnt(mnt, parent, mp, false);
put_mountpoint(old_mp); put_mountpoint(old_mp);
mnt_add_count(old_parent, -1); mnt_add_count(old_parent, -1);
...@@ -1774,6 +1844,19 @@ bool may_mount(void) ...@@ -1774,6 +1844,19 @@ bool may_mount(void)
return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN); return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN);
} }
/**
* path_mounted - check whether path is mounted
* @path: path to check
*
* Determine whether @path refers to the root of a mount.
*
* Return: true if @path is the root of a mount, false if not.
*/
static inline bool path_mounted(const struct path *path)
{
return path->mnt->mnt_root == path->dentry;
}
static void warn_mandlock(void) static void warn_mandlock(void)
{ {
pr_warn_once("=======================================================\n" pr_warn_once("=======================================================\n"
...@@ -1789,7 +1872,7 @@ static int can_umount(const struct path *path, int flags) ...@@ -1789,7 +1872,7 @@ static int can_umount(const struct path *path, int flags)
if (!may_mount()) if (!may_mount())
return -EPERM; return -EPERM;
if (path->dentry != path->mnt->mnt_root) if (!path_mounted(path))
return -EINVAL; return -EINVAL;
if (!check_mnt(mnt)) if (!check_mnt(mnt))
return -EINVAL; return -EINVAL;
...@@ -1932,7 +2015,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, ...@@ -1932,7 +2015,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
goto out; goto out;
lock_mount_hash(); lock_mount_hash();
list_add_tail(&q->mnt_list, &res->mnt_list); list_add_tail(&q->mnt_list, &res->mnt_list);
attach_mnt(q, parent, p->mnt_mp); attach_mnt(q, parent, p->mnt_mp, false);
unlock_mount_hash(); unlock_mount_hash();
} }
} }
...@@ -2141,12 +2224,17 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt) ...@@ -2141,12 +2224,17 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
return 0; return 0;
} }
/* enum mnt_tree_flags_t {
* @source_mnt : mount tree to be attached MNT_TREE_MOVE = BIT(0),
* @nd : place the mount tree @source_mnt is attached MNT_TREE_BENEATH = BIT(1),
* @parent_nd : if non-null, detach the source_mnt from its parent and };
* store the parent mount and mountpoint dentry.
* (done when source_mnt is moved) /**
* attach_recursive_mnt - attach a source mount tree
* @source_mnt: mount tree to be attached
* @top_mnt: mount that @source_mnt will be mounted on or mounted beneath
* @dest_mp: the mountpoint @source_mnt will be mounted at
* @flags: modify how @source_mnt is supposed to be attached
* *
* NOTE: in the table below explains the semantics when a source mount * NOTE: in the table below explains the semantics when a source mount
* of a given type is attached to a destination mount of a given type. * of a given type is attached to a destination mount of a given type.
...@@ -2203,22 +2291,28 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt) ...@@ -2203,22 +2291,28 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
* applied to each mount in the tree. * applied to each mount in the tree.
* Must be called without spinlocks held, since this function can sleep * Must be called without spinlocks held, since this function can sleep
* in allocations. * in allocations.
*
* Context: The function expects namespace_lock() to be held.
* Return: If @source_mnt was successfully attached 0 is returned.
* Otherwise a negative error code is returned.
*/ */
static int attach_recursive_mnt(struct mount *source_mnt, static int attach_recursive_mnt(struct mount *source_mnt,
struct mount *dest_mnt, struct mount *top_mnt,
struct mountpoint *dest_mp, struct mountpoint *dest_mp,
bool moving) enum mnt_tree_flags_t flags)
{ {
struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns; struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
HLIST_HEAD(tree_list); HLIST_HEAD(tree_list);
struct mnt_namespace *ns = dest_mnt->mnt_ns; struct mnt_namespace *ns = top_mnt->mnt_ns;
struct mountpoint *smp; struct mountpoint *smp;
struct mount *child, *p; struct mount *child, *dest_mnt, *p;
struct hlist_node *n; struct hlist_node *n;
int err; int err = 0;
bool moving = flags & MNT_TREE_MOVE, beneath = flags & MNT_TREE_BENEATH;
/* Preallocate a mountpoint in case the new mounts need /*
* to be tucked under other mounts. * Preallocate a mountpoint in case the new mounts need to be
* mounted beneath mounts on the same mountpoint.
*/ */
smp = get_mountpoint(source_mnt->mnt.mnt_root); smp = get_mountpoint(source_mnt->mnt.mnt_root);
if (IS_ERR(smp)) if (IS_ERR(smp))
...@@ -2231,29 +2325,41 @@ static int attach_recursive_mnt(struct mount *source_mnt, ...@@ -2231,29 +2325,41 @@ static int attach_recursive_mnt(struct mount *source_mnt,
goto out; goto out;
} }
if (beneath)
dest_mnt = top_mnt->mnt_parent;
else
dest_mnt = top_mnt;
if (IS_MNT_SHARED(dest_mnt)) { if (IS_MNT_SHARED(dest_mnt)) {
err = invent_group_ids(source_mnt, true); err = invent_group_ids(source_mnt, true);
if (err) if (err)
goto out; goto out;
err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list); err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
lock_mount_hash(); }
if (err) lock_mount_hash();
goto out_cleanup_ids; if (err)
goto out_cleanup_ids;
if (IS_MNT_SHARED(dest_mnt)) {
for (p = source_mnt; p; p = next_mnt(p, source_mnt)) for (p = source_mnt; p; p = next_mnt(p, source_mnt))
set_mnt_shared(p); set_mnt_shared(p);
} else {
lock_mount_hash();
} }
if (moving) { if (moving) {
if (beneath)
dest_mp = smp;
unhash_mnt(source_mnt); unhash_mnt(source_mnt);
attach_mnt(source_mnt, dest_mnt, dest_mp); attach_mnt(source_mnt, top_mnt, dest_mp, beneath);
touch_mnt_namespace(source_mnt->mnt_ns); touch_mnt_namespace(source_mnt->mnt_ns);
} else { } else {
if (source_mnt->mnt_ns) { if (source_mnt->mnt_ns) {
/* move from anon - the caller will destroy */ /* move from anon - the caller will destroy */
list_del_init(&source_mnt->mnt_ns->list); list_del_init(&source_mnt->mnt_ns->list);
} }
mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); if (beneath)
mnt_set_mountpoint_beneath(source_mnt, top_mnt, smp);
else
mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
commit_tree(source_mnt); commit_tree(source_mnt);
} }
...@@ -2293,33 +2399,101 @@ static int attach_recursive_mnt(struct mount *source_mnt, ...@@ -2293,33 +2399,101 @@ static int attach_recursive_mnt(struct mount *source_mnt,
return err; return err;
} }
static struct mountpoint *lock_mount(struct path *path) /**
* do_lock_mount - lock mount and mountpoint
* @path: target path
* @beneath: whether the intention is to mount beneath @path
*
* Follow the mount stack on @path until the top mount @mnt is found. If
* the initial @path->{mnt,dentry} is a mountpoint lookup the first
* mount stacked on top of it. Then simply follow @{mnt,mnt->mnt_root}
* until nothing is stacked on top of it anymore.
*
* Acquire the inode_lock() on the top mount's ->mnt_root to protect
* against concurrent removal of the new mountpoint from another mount
* namespace.
*
* If @beneath is requested, acquire inode_lock() on @mnt's mountpoint
* @mp on @mnt->mnt_parent must be acquired. This protects against a
* concurrent unlink of @mp->mnt_dentry from another mount namespace
* where @mnt doesn't have a child mount mounted @mp. A concurrent
* removal of @mnt->mnt_root doesn't matter as nothing will be mounted
* on top of it for @beneath.
*
* In addition, @beneath needs to make sure that @mnt hasn't been
* unmounted or moved from its current mountpoint in between dropping
* @mount_lock and acquiring @namespace_sem. For the !@beneath case @mnt
* being unmounted would be detected later by e.g., calling
* check_mnt(mnt) in the function it's called from. For the @beneath
* case however, it's useful to detect it directly in do_lock_mount().
* If @mnt hasn't been unmounted then @mnt->mnt_mountpoint still points
* to @mnt->mnt_mp->m_dentry. But if @mnt has been unmounted it will
* point to @mnt->mnt_root and @mnt->mnt_mp will be NULL.
*
* Return: Either the target mountpoint on the top mount or the top
* mount's mountpoint.
*/
static struct mountpoint *do_lock_mount(struct path *path, bool beneath)
{ {
struct vfsmount *mnt; struct vfsmount *mnt = path->mnt;
struct dentry *dentry = path->dentry; struct dentry *dentry;
retry: struct mountpoint *mp = ERR_PTR(-ENOENT);
inode_lock(dentry->d_inode);
if (unlikely(cant_mount(dentry))) { for (;;) {
inode_unlock(dentry->d_inode); struct mount *m;
return ERR_PTR(-ENOENT);
} if (beneath) {
namespace_lock(); m = real_mount(mnt);
mnt = lookup_mnt(path); read_seqlock_excl(&mount_lock);
if (likely(!mnt)) { dentry = dget(m->mnt_mountpoint);
struct mountpoint *mp = get_mountpoint(dentry); read_sequnlock_excl(&mount_lock);
if (IS_ERR(mp)) { } else {
dentry = path->dentry;
}
inode_lock(dentry->d_inode);
if (unlikely(cant_mount(dentry))) {
inode_unlock(dentry->d_inode);
goto out;
}
namespace_lock();
if (beneath && (!is_mounted(mnt) || m->mnt_mountpoint != dentry)) {
namespace_unlock(); namespace_unlock();
inode_unlock(dentry->d_inode); inode_unlock(dentry->d_inode);
return mp; goto out;
} }
return mp;
mnt = lookup_mnt(path);
if (likely(!mnt))
break;
namespace_unlock();
inode_unlock(dentry->d_inode);
if (beneath)
dput(dentry);
path_put(path);
path->mnt = mnt;
path->dentry = dget(mnt->mnt_root);
} }
namespace_unlock();
inode_unlock(path->dentry->d_inode); mp = get_mountpoint(dentry);
path_put(path); if (IS_ERR(mp)) {
path->mnt = mnt; namespace_unlock();
dentry = path->dentry = dget(mnt->mnt_root); inode_unlock(dentry->d_inode);
goto retry; }
out:
if (beneath)
dput(dentry);
return mp;
}
static inline struct mountpoint *lock_mount(struct path *path)
{
return do_lock_mount(path, false);
} }
static void unlock_mount(struct mountpoint *where) static void unlock_mount(struct mountpoint *where)
...@@ -2343,7 +2517,7 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) ...@@ -2343,7 +2517,7 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
d_is_dir(mnt->mnt.mnt_root)) d_is_dir(mnt->mnt.mnt_root))
return -ENOTDIR; return -ENOTDIR;
return attach_recursive_mnt(mnt, p, mp, false); return attach_recursive_mnt(mnt, p, mp, 0);
} }
/* /*
...@@ -2374,7 +2548,7 @@ static int do_change_type(struct path *path, int ms_flags) ...@@ -2374,7 +2548,7 @@ static int do_change_type(struct path *path, int ms_flags)
int type; int type;
int err = 0; int err = 0;
if (path->dentry != path->mnt->mnt_root) if (!path_mounted(path))
return -EINVAL; return -EINVAL;
type = flags_to_propagation_type(ms_flags); type = flags_to_propagation_type(ms_flags);
...@@ -2650,7 +2824,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) ...@@ -2650,7 +2824,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
if (!check_mnt(mnt)) if (!check_mnt(mnt))
return -EINVAL; return -EINVAL;
if (path->dentry != mnt->mnt.mnt_root) if (!path_mounted(path))
return -EINVAL; return -EINVAL;
if (!can_change_locked_flags(mnt, mnt_flags)) if (!can_change_locked_flags(mnt, mnt_flags))
...@@ -2689,7 +2863,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags, ...@@ -2689,7 +2863,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
if (!check_mnt(mnt)) if (!check_mnt(mnt))
return -EINVAL; return -EINVAL;
if (path->dentry != path->mnt->mnt_root) if (!path_mounted(path))
return -EINVAL; return -EINVAL;
if (!can_change_locked_flags(mnt, mnt_flags)) if (!can_change_locked_flags(mnt, mnt_flags))
...@@ -2779,9 +2953,9 @@ static int do_set_group(struct path *from_path, struct path *to_path) ...@@ -2779,9 +2953,9 @@ static int do_set_group(struct path *from_path, struct path *to_path)
err = -EINVAL; err = -EINVAL;
/* To and From paths should be mount roots */ /* To and From paths should be mount roots */
if (from_path->dentry != from_path->mnt->mnt_root) if (!path_mounted(from_path))
goto out; goto out;
if (to_path->dentry != to_path->mnt->mnt_root) if (!path_mounted(to_path))
goto out; goto out;
/* Setting sharing groups is only allowed across same superblock */ /* Setting sharing groups is only allowed across same superblock */
...@@ -2825,7 +2999,110 @@ static int do_set_group(struct path *from_path, struct path *to_path) ...@@ -2825,7 +2999,110 @@ static int do_set_group(struct path *from_path, struct path *to_path)
return err; return err;
} }
static int do_move_mount(struct path *old_path, struct path *new_path) /**
* path_overmounted - check if path is overmounted
* @path: path to check
*
* Check if path is overmounted, i.e., if there's a mount on top of
* @path->mnt with @path->dentry as mountpoint.
*
* Context: This function expects namespace_lock() to be held.
* Return: If path is overmounted true is returned, false if not.
*/
static inline bool path_overmounted(const struct path *path)
{
rcu_read_lock();
if (unlikely(__lookup_mnt(path->mnt, path->dentry))) {
rcu_read_unlock();
return true;
}
rcu_read_unlock();
return false;
}
/**
* can_move_mount_beneath - check that we can mount beneath the top mount
* @from: mount to mount beneath
* @to: mount under which to mount
*
* - Make sure that @to->dentry is actually the root of a mount under
* which we can mount another mount.
* - Make sure that nothing can be mounted beneath the caller's current
* root or the rootfs of the namespace.
* - Make sure that the caller can unmount the topmost mount ensuring
* that the caller could reveal the underlying mountpoint.
* - Ensure that nothing has been mounted on top of @from before we
* grabbed @namespace_sem to avoid creating pointless shadow mounts.
* - Prevent mounting beneath a mount if the propagation relationship
* between the source mount, parent mount, and top mount would lead to
* nonsensical mount trees.
*
* Context: This function expects namespace_lock() to be held.
* Return: On success 0, and on error a negative error code is returned.
*/
static int can_move_mount_beneath(const struct path *from,
const struct path *to,
const struct mountpoint *mp)
{
struct mount *mnt_from = real_mount(from->mnt),
*mnt_to = real_mount(to->mnt),
*parent_mnt_to = mnt_to->mnt_parent;
if (!mnt_has_parent(mnt_to))
return -EINVAL;
if (!path_mounted(to))
return -EINVAL;
if (IS_MNT_LOCKED(mnt_to))
return -EINVAL;
/* Avoid creating shadow mounts during mount propagation. */
if (path_overmounted(from))
return -EINVAL;
/*
* Mounting beneath the rootfs only makes sense when the
* semantics of pivot_root(".", ".") are used.
*/
if (&mnt_to->mnt == current->fs->root.mnt)
return -EINVAL;
if (parent_mnt_to == current->nsproxy->mnt_ns->root)
return -EINVAL;
for (struct mount *p = mnt_from; mnt_has_parent(p); p = p->mnt_parent)
if (p == mnt_to)
return -EINVAL;
/*
* If the parent mount propagates to the child mount this would
* mean mounting @mnt_from on @mnt_to->mnt_parent and then
* propagating a copy @c of @mnt_from on top of @mnt_to. This
* defeats the whole purpose of mounting beneath another mount.
*/
if (propagation_would_overmount(parent_mnt_to, mnt_to, mp))
return -EINVAL;
/*
* If @mnt_to->mnt_parent propagates to @mnt_from this would
* mean propagating a copy @c of @mnt_from on top of @mnt_from.
* Afterwards @mnt_from would be mounted on top of
* @mnt_to->mnt_parent and @mnt_to would be unmounted from
* @mnt->mnt_parent and remounted on @mnt_from. But since @c is
* already mounted on @mnt_from, @mnt_to would ultimately be
* remounted on top of @c. Afterwards, @mnt_from would be
* covered by a copy @c of @mnt_from and @c would be covered by
* @mnt_from itself. This defeats the whole purpose of mounting
* @mnt_from beneath @mnt_to.
*/
if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
return -EINVAL;
return 0;
}
static int do_move_mount(struct path *old_path, struct path *new_path,
bool beneath)
{ {
struct mnt_namespace *ns; struct mnt_namespace *ns;
struct mount *p; struct mount *p;
...@@ -2834,8 +3111,9 @@ static int do_move_mount(struct path *old_path, struct path *new_path) ...@@ -2834,8 +3111,9 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
struct mountpoint *mp, *old_mp; struct mountpoint *mp, *old_mp;
int err; int err;
bool attached; bool attached;
enum mnt_tree_flags_t flags = 0;
mp = lock_mount(new_path); mp = do_lock_mount(new_path, beneath);
if (IS_ERR(mp)) if (IS_ERR(mp))
return PTR_ERR(mp); return PTR_ERR(mp);
...@@ -2843,6 +3121,8 @@ static int do_move_mount(struct path *old_path, struct path *new_path) ...@@ -2843,6 +3121,8 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
p = real_mount(new_path->mnt); p = real_mount(new_path->mnt);
parent = old->mnt_parent; parent = old->mnt_parent;
attached = mnt_has_parent(old); attached = mnt_has_parent(old);
if (attached)
flags |= MNT_TREE_MOVE;
old_mp = old->mnt_mp; old_mp = old->mnt_mp;
ns = old->mnt_ns; ns = old->mnt_ns;
...@@ -2862,7 +3142,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path) ...@@ -2862,7 +3142,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (old->mnt.mnt_flags & MNT_LOCKED) if (old->mnt.mnt_flags & MNT_LOCKED)
goto out; goto out;
if (old_path->dentry != old_path->mnt->mnt_root) if (!path_mounted(old_path))
goto out; goto out;
if (d_is_dir(new_path->dentry) != if (d_is_dir(new_path->dentry) !=
...@@ -2873,6 +3153,17 @@ static int do_move_mount(struct path *old_path, struct path *new_path) ...@@ -2873,6 +3153,17 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
*/ */
if (attached && IS_MNT_SHARED(parent)) if (attached && IS_MNT_SHARED(parent))
goto out; goto out;
if (beneath) {
err = can_move_mount_beneath(old_path, new_path, mp);
if (err)
goto out;
err = -EINVAL;
p = p->mnt_parent;
flags |= MNT_TREE_BENEATH;
}
/* /*
* Don't move a mount tree containing unbindable mounts to a destination * Don't move a mount tree containing unbindable mounts to a destination
* mount which is shared. * mount which is shared.
...@@ -2886,8 +3177,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path) ...@@ -2886,8 +3177,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (p == old) if (p == old)
goto out; goto out;
err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, flags);
attached);
if (err) if (err)
goto out; goto out;
...@@ -2919,7 +3209,7 @@ static int do_move_mount_old(struct path *path, const char *old_name) ...@@ -2919,7 +3209,7 @@ static int do_move_mount_old(struct path *path, const char *old_name)
if (err) if (err)
return err; return err;
err = do_move_mount(&old_path, path); err = do_move_mount(&old_path, path, false);
path_put(&old_path); path_put(&old_path);
return err; return err;
} }
...@@ -2944,8 +3234,7 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, ...@@ -2944,8 +3234,7 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
} }
/* Refuse the same filesystem on the same mount point */ /* Refuse the same filesystem on the same mount point */
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path_mounted(path))
path->mnt->mnt_root == path->dentry)
return -EBUSY; return -EBUSY;
if (d_is_symlink(newmnt->mnt.mnt_root)) if (d_is_symlink(newmnt->mnt.mnt_root))
...@@ -3086,13 +3375,10 @@ int finish_automount(struct vfsmount *m, const struct path *path) ...@@ -3086,13 +3375,10 @@ int finish_automount(struct vfsmount *m, const struct path *path)
err = -ENOENT; err = -ENOENT;
goto discard_locked; goto discard_locked;
} }
rcu_read_lock(); if (path_overmounted(path)) {
if (unlikely(__lookup_mnt(path->mnt, dentry))) {
rcu_read_unlock();
err = 0; err = 0;
goto discard_locked; goto discard_locked;
} }
rcu_read_unlock();
mp = get_mountpoint(dentry); mp = get_mountpoint(dentry);
if (IS_ERR(mp)) { if (IS_ERR(mp)) {
err = PTR_ERR(mp); err = PTR_ERR(mp);
...@@ -3784,6 +4070,10 @@ SYSCALL_DEFINE5(move_mount, ...@@ -3784,6 +4070,10 @@ SYSCALL_DEFINE5(move_mount,
if (flags & ~MOVE_MOUNT__MASK) if (flags & ~MOVE_MOUNT__MASK)
return -EINVAL; return -EINVAL;
if ((flags & (MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP)) ==
(MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP))
return -EINVAL;
/* If someone gives a pathname, they aren't permitted to move /* If someone gives a pathname, they aren't permitted to move
* from an fd that requires unmount as we can't get at the flag * from an fd that requires unmount as we can't get at the flag
* to clear it afterwards. * to clear it afterwards.
...@@ -3813,7 +4103,8 @@ SYSCALL_DEFINE5(move_mount, ...@@ -3813,7 +4103,8 @@ SYSCALL_DEFINE5(move_mount,
if (flags & MOVE_MOUNT_SET_GROUP) if (flags & MOVE_MOUNT_SET_GROUP)
ret = do_set_group(&from_path, &to_path); ret = do_set_group(&from_path, &to_path);
else else
ret = do_move_mount(&from_path, &to_path); ret = do_move_mount(&from_path, &to_path,
(flags & MOVE_MOUNT_BENEATH));
out_to: out_to:
path_put(&to_path); path_put(&to_path);
...@@ -3924,11 +4215,11 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, ...@@ -3924,11 +4215,11 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
if (new_mnt == root_mnt || old_mnt == root_mnt) if (new_mnt == root_mnt || old_mnt == root_mnt)
goto out4; /* loop, on the same file system */ goto out4; /* loop, on the same file system */
error = -EINVAL; error = -EINVAL;
if (root.mnt->mnt_root != root.dentry) if (!path_mounted(&root))
goto out4; /* not a mountpoint */ goto out4; /* not a mountpoint */
if (!mnt_has_parent(root_mnt)) if (!mnt_has_parent(root_mnt))
goto out4; /* not attached */ goto out4; /* not attached */
if (new.mnt->mnt_root != new.dentry) if (!path_mounted(&new))
goto out4; /* not a mountpoint */ goto out4; /* not a mountpoint */
if (!mnt_has_parent(new_mnt)) if (!mnt_has_parent(new_mnt))
goto out4; /* not attached */ goto out4; /* not attached */
...@@ -3946,9 +4237,9 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, ...@@ -3946,9 +4237,9 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
root_mnt->mnt.mnt_flags &= ~MNT_LOCKED; root_mnt->mnt.mnt_flags &= ~MNT_LOCKED;
} }
/* mount old root on put_old */ /* mount old root on put_old */
attach_mnt(root_mnt, old_mnt, old_mp); attach_mnt(root_mnt, old_mnt, old_mp, false);
/* mount new_root on / */ /* mount new_root on / */
attach_mnt(new_mnt, root_parent, root_mp); attach_mnt(new_mnt, root_parent, root_mp, false);
mnt_add_count(root_parent, -1); mnt_add_count(root_parent, -1);
touch_mnt_namespace(current->nsproxy->mnt_ns); touch_mnt_namespace(current->nsproxy->mnt_ns);
/* A moved mount should not expire automatically */ /* A moved mount should not expire automatically */
...@@ -4131,7 +4422,7 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr) ...@@ -4131,7 +4422,7 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
struct mount *mnt = real_mount(path->mnt); struct mount *mnt = real_mount(path->mnt);
int err = 0; int err = 0;
if (path->dentry != mnt->mnt.mnt_root) if (!path_mounted(path))
return -EINVAL; return -EINVAL;
if (kattr->mnt_userns) { if (kattr->mnt_userns) {
......
...@@ -216,7 +216,7 @@ static struct mount *next_group(struct mount *m, struct mount *origin) ...@@ -216,7 +216,7 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
static struct mount *last_dest, *first_source, *last_source, *dest_master; static struct mount *last_dest, *first_source, *last_source, *dest_master;
static struct hlist_head *list; static struct hlist_head *list;
static inline bool peers(struct mount *m1, struct mount *m2) static inline bool peers(const struct mount *m1, const struct mount *m2)
{ {
return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id; return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
} }
...@@ -354,6 +354,46 @@ static inline int do_refcount_check(struct mount *mnt, int count) ...@@ -354,6 +354,46 @@ static inline int do_refcount_check(struct mount *mnt, int count)
return mnt_get_count(mnt) > count; return mnt_get_count(mnt) > count;
} }
/**
* propagation_would_overmount - check whether propagation from @from
* would overmount @to
* @from: shared mount
* @to: mount to check
* @mp: future mountpoint of @to on @from
*
* If @from propagates mounts to @to, @from and @to must either be peers
* or one of the masters in the hierarchy of masters of @to must be a
* peer of @from.
*
* If the root of the @to mount is equal to the future mountpoint @mp of
* the @to mount on @from then @to will be overmounted by whatever is
* propagated to it.
*
* Context: This function expects namespace_lock() to be held and that
* @mp is stable.
* Return: If @from overmounts @to, true is returned, false if not.
*/
bool propagation_would_overmount(const struct mount *from,
const struct mount *to,
const struct mountpoint *mp)
{
if (!IS_MNT_SHARED(from))
return false;
if (IS_MNT_NEW(to))
return false;
if (to->mnt.mnt_root != mp->m_dentry)
return false;
for (const struct mount *m = to; m; m = m->mnt_master) {
if (peers(from, m))
return true;
}
return false;
}
/* /*
* check if the mount 'mnt' can be unmounted successfully. * check if the mount 'mnt' can be unmounted successfully.
* @mnt: the mount to be checked for unmount * @mnt: the mount to be checked for unmount
......
...@@ -53,4 +53,7 @@ struct mount *copy_tree(struct mount *, struct dentry *, int); ...@@ -53,4 +53,7 @@ struct mount *copy_tree(struct mount *, struct dentry *, int);
bool is_path_reachable(struct mount *, struct dentry *, bool is_path_reachable(struct mount *, struct dentry *,
const struct path *root); const struct path *root);
int count_mounts(struct mnt_namespace *ns, struct mount *mnt); int count_mounts(struct mnt_namespace *ns, struct mount *mnt);
bool propagation_would_overmount(const struct mount *from,
const struct mount *to,
const struct mountpoint *mp);
#endif /* _LINUX_PNODE_H */ #endif /* _LINUX_PNODE_H */
...@@ -74,7 +74,8 @@ ...@@ -74,7 +74,8 @@
#define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */ #define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */
#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */ #define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */
#define MOVE_MOUNT_SET_GROUP 0x00000100 /* Set sharing group instead */ #define MOVE_MOUNT_SET_GROUP 0x00000100 /* Set sharing group instead */
#define MOVE_MOUNT__MASK 0x00000177 #define MOVE_MOUNT_BENEATH 0x00000200 /* Mount beneath top mount */
#define MOVE_MOUNT__MASK 0x00000377
/* /*
* fsopen() flags. * fsopen() flags.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment