• Christian Brauner's avatar
    fs: add FSCONFIG_CMD_CREATE_EXCL · 22ed7ecd
    Christian Brauner authored
    Summary
    =======
    
    This introduces FSCONFIG_CMD_CREATE_EXCL which will allows userspace to
    implement something like mount -t ext4 --exclusive /dev/sda /B which
    fails if a superblock for the requested filesystem does already exist:
    
    Before this patch
    -----------------
    
    $ sudo ./move-mount -f xfs -o source=/dev/sda4 /A
    Requesting filesystem type xfs
    Mount options requested: source=/dev/sda4
    Attaching mount at /A
    Moving single attached mount
    Setting key(source) with val(/dev/sda4)
    
    $ sudo ./move-mount -f xfs -o source=/dev/sda4 /B
    Requesting filesystem type xfs
    Mount options requested: source=/dev/sda4
    Attaching mount at /B
    Moving single attached mount
    Setting key(source) with val(/dev/sda4)
    
    After this patch with --exclusive as a switch for FSCONFIG_CMD_CREATE_EXCL
    --------------------------------------------------------------------------
    
    $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /A
    Requesting filesystem type xfs
    Request exclusive superblock creation
    Mount options requested: source=/dev/sda4
    Attaching mount at /A
    Moving single attached mount
    Setting key(source) with val(/dev/sda4)
    
    $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /B
    Requesting filesystem type xfs
    Request exclusive superblock creation
    Mount options requested: source=/dev/sda4
    Attaching mount at /B
    Moving single attached mount
    Setting key(source) with val(/dev/sda4)
    Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed
    
    Details
    =======
    
    As mentioned on the list (cf. [1]-[3]) mount requests like
    mount -t ext4 /dev/sda /A are ambigous for userspace. Either a new
    superblock has been created and mounted or an existing superblock has
    been reused and a bind-mount has been created.
    
    This becomes clear in the following example where two processes create
    the same mount for the same block device:
    
    P1                                                              P2
    fd_fs = fsopen("ext4");                                         fd_fs = fsopen("ext4");
    fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");     fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");
    fsconfig(fd_fs, FSCONFIG_SET_STRING, "dax", "always");          fsconfig(fd_fs, FSCONFIG_SET_STRING, "resuid", "1000");
    
    // wins and creates superblock
    fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
                                                                    // finds compatible superblock of P1
                                                                    // spins until P1 sets SB_BORN and grabs a reference
                                                                    fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
    
    fd_mnt1 = fsmount(fd_fs);                                       fd_mnt2 = fsmount(fd_fs);
    move_mount(fd_mnt1, "/A")                                       move_mount(fd_mnt2, "/B")
    
    Not just does P2 get a bind-mount but the mount options that P2
    requestes are silently ignored. The VFS itself doesn't, can't and
    shouldn't enforce filesystem specific mount option compatibility. It
    only enforces incompatibility for read-only <-> read-write transitions:
    
    mount -t ext4       /dev/sda /A
    mount -t ext4 -o ro /dev/sda /B
    
    The read-only request will fail with EBUSY as the VFS can't just
    silently transition a superblock from read-write to read-only or vica
    versa without risking security issues.
    
    To userspace this silent superblock reuse can become a security issue in
    because there is currently no straightforward way for userspace to know
    that they did indeed manage to create a new superblock and didn't just
    reuse an existing one.
    
    This adds a new FSCONFIG_CMD_CREATE_EXCL command to fsconfig() that
    returns EBUSY if an existing superblock would be reused. Userspace that
    needs to be sure that it did create a new superblock with the requested
    mount options can request superblock creation using this command. If the
    command succeeds they can be sure that they did create a new superblock
    with the requested mount options.
    
    This requires the new mount api. With the old mount api it would be
    necessary to plumb this through every legacy filesystem's
    file_system_type->mount() method. If they want this feature they are
    most welcome to switch to the new mount api.
    
    Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
    each high-level superblock creation helper:
    
    (1) get_tree_nodev()
    
        Always allocate new superblock. Hence, FSCONFIG_CMD_CREATE and
        FSCONFIG_CMD_CREATE_EXCL are equivalent.
    
        The binderfs or overlayfs filesystems are examples.
    
    (4) get_tree_keyed()
    
        Finds an existing superblock based on sb->s_fs_info. Hence,
        FSCONFIG_CMD_CREATE would reuse an existing superblock whereas
        FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.
    
        The mqueue or nfsd filesystems are examples.
    
    (2) get_tree_bdev()
    
        This effectively works like get_tree_keyed().
    
        The ext4 or xfs filesystems are examples.
    
    (3) get_tree_single()
    
        Only one superblock of this filesystem type can ever exist.
        Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock
        whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.
    
        The securityfs or configfs filesystems are examples.
    
        Note that some single-instance filesystems never destroy the
        superblock once it has been created during the first mount. For
        example, if securityfs has been mounted at least onces then the
        created superblock will never be destroyed again as long as there is
        still an LSM making use it. Consequently, even if securityfs is
        unmounted and the superblock seemingly destroyed it really isn't
        which means that FSCONFIG_CMD_CREATE_EXCL will continue rejecting
        reusing an existing superblock.
    
        This is acceptable thugh since special purpose filesystems such as
        this shouldn't have a need to use FSCONFIG_CMD_CREATE_EXCL anyway
        and if they do it's probably to make sure that mount options aren't
        ignored.
    
    Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
    filesystems that make use of the low-level sget_fc() helper directly.
    They're all effectively variants on get_tree_keyed(), get_tree_bdev(),
    or get_tree_nodev():
    
    (5) mtd_get_sb()
    
        Similar logic to get_tree_keyed().
    
    (6) afs_get_tree()
    
        Similar logic to get_tree_keyed().
    
    (7) ceph_get_tree()
    
        Similar logic to get_tree_keyed().
    
        Already explicitly allows forcing the allocation of a new superblock
        via CEPH_OPT_NOSHARE. This turns it into get_tree_nodev().
    
    (8) fuse_get_tree_submount()
    
        Similar logic to get_tree_nodev().
    
    (9) fuse_get_tree()
    
        Forces reuse of existing FUSE superblock.
    
        Forces reuse of existing superblock if passed in file refers to an
        existing FUSE connection.
        If FSCONFIG_CMD_CREATE_EXCL is specified together with an fd
        referring to an existing FUSE connections this would cause the
        superblock reusal to fail. If reusing is the intent then
        FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.
    
    (10) fuse_get_tree()
         -> get_tree_nodev()
    
        Same logic as in get_tree_nodev().
    
    (11) fuse_get_tree()
         -> get_tree_bdev()
    
        Same logic as in get_tree_bdev().
    
    (12) virtio_fs_get_tree()
    
         Same logic as get_tree_keyed().
    
    (13) gfs2_meta_get_tree()
    
         Forces reuse of existing gfs2 superblock.
    
         Mounting gfs2meta enforces that a gf2s superblock must already
         exist. If not, it will error out. Consequently, mounting gfs2meta
         with FSCONFIG_CMD_CREATE_EXCL would always fail. If reusing is the
         intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.
    
    (14) kernfs_get_tree()
    
         Similar logic to get_tree_keyed().
    
    (15) nfs_get_tree_common()
    
        Similar logic to get_tree_keyed().
    
        Already explicitly allows forcing the allocation of a new superblock
        via NFS_MOUNT_UNSHARED. This effectively turns it into
        get_tree_nodev().
    
    Link: [1] https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
    Link: [2] https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner
    Link: [3] https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@braunerReviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Reviewed-by: default avatarAleksa Sarai <cyphar@cyphar.com>
    Message-Id: <20230802-vfs-super-exclusive-v2-4-95dc4e41b870@kernel.org>
    Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    22ed7ecd
fsopen.c 11.6 KB