1. 23 Aug, 2021 40 commits
    • Christian Brauner's avatar
      btrfs: allow idmapped INO_LOOKUP_USER ioctl · 6623d9a0
      Christian Brauner authored
      The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
      and has the following restrictions. The main difference between the two
      is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
      is scoped beneath the file descriptor passed with the ioctl.
      Specifically, INO_LOOKUP_USER must adhere to the following restrictions:
      
      - The caller must be privileged over each inode of each path component
        for the path they are trying to lookup.
      
      - The path for the subvolume the caller is trying to lookup must be reachable
        from the inode associated with the file descriptor passed with the ioctl.
      
      The second condition makes it possible to scope the lookup of the path
      to the mount identified by the file descriptor passed with the ioctl.
      This allows us to enable this ioctl on idmapped mounts.
      
      Specifically, this is possible because all child subvolumes of a parent
      subvolume are reachable when the parent subvolume is mounted. So if the
      user had access to open the parent subvolume or has been given the fd
      then they can lookup the path if they had access to it provided they
      were privileged over each path component.
      
      Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
      of a subvolume even though they would otherwise be restricted from doing
      so via regular VFS-based lookup.
      
      So think about a parent subvolume with multiple child subvolumes.
      Someone could mount he parent subvolume and restrict access to the child
      subvolumes by overmounting them with empty directories. At this point
      the user can't traverse the child subvolumes and they can't open files
      in the child subvolumes.  However, they can still learn the path of
      child subvolumes as long as they have access to the parent subvolume by
      using the INO_LOOKUP_USER ioctl.
      
      The underlying assumption here is that it's ok that the lookup ioctls
      can't really take mounts into account other than the original mount the
      fd belongs to during lookup. Since this assumption is baked into the
      original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6623d9a0
    • Christian Brauner's avatar
      btrfs: allow idmapped SUBVOL_SETFLAGS ioctl · 39e1674f
      Christian Brauner authored
      Setting flags on subvolumes or snapshots are core features of btrfs. The
      SUBVOL_SETFLAGS ioctl is especially important as it allows to make
      subvolumes and snapshots read-only or read-write. Allow setting flags on
      btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39e1674f
    • Christian Brauner's avatar
      btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls · e4fed17a
      Christian Brauner authored
      The SET_RECEIVED_SUBVOL ioctls are used to set information about
      a received subvolume. Make it possible to set information about a
      received subvolume on idmapped mounts. This is a fairly straightforward
      operation since all the permission checking helpers are already capable
      of handling idmapped mounts. So we just need to pass down the mount's
      userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4fed17a
    • Christian Brauner's avatar
      btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids · aabb34e7
      Christian Brauner authored
      So far we prevented the deletion of subvolumes and snapshots using
      subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.
      
      This restriction is necessary on idmapped mounts as this allows
      filesystem wide subvolume and snapshot deletions and thus can escape the
      scope of what's exposed under the mount identified by the fd passed with
      the ioctl.
      
      Deletion by subvolume id works by looking for an alias of the parent of
      the subvolume or snapshot to be deleted. The parent alias can be
      anywhere in the filesystem. However, as long as the alias of the parent
      that is found is the same as the one identified by the file descriptor
      passed through the ioctl we can allow the deletion.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aabb34e7
    • Christian Brauner's avatar
      btrfs: allow idmapped SNAP_DESTROY ioctls · c4ed533b
      Christian Brauner authored
      Destroying subvolumes and snapshots are important features of btrfs.
      Both operations are available to unprivileged users if the filesystem
      has been mounted with the "user_subvol_rm_allowed" mount option. Allow
      subvolume and snapshot deletion on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      
      Subvolumes and snapshots can either be deleted by specifying their name
      or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
      snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
      
      This feature is blocked on idmapped mounts as this allows filesystem
      wide subvolume deletions and thus can escape the scope of what's exposed
      under the mount identified by the fd passed with the ioctl.
      
      This means that even the root or CAP_SYS_ADMIN capable user can't delete
      a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
      
      The root user is currently already subject to permission checks in
      btrfs_may_delete() including whether the inode's i_uid/i_gid of the
      directory the subvolume is located in have a mapping in the caller's
      idmapping. For this to fail isn't currently possible since a btrfs
      filesystem can't be mounted with a non-initial idmapping but it shows
      that even the root user would fail to delete a subvolume if the relevant
      inode isn't mapped in their idmapping. The idmapped mount case is the
      same in principle.
      
      This isn't a huge problem a root user wanting to delete arbitrary
      subvolumes can just always create another (even detached) mount without
      an idmapping attached.
      
      In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
      subvolume to delete is directly located under inode referenced by the fd
      passed for the ioctl() in a follow-up commit.
      
      Here is an example where a btrfs subvolume is deleted through a
      subvolume mount that does not expose the subvolume to be delete but it
      can still be deleted by using the subvolume id:
      
        /* Compile the following program as "delete_by_spec". */
      
        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <inttypes.h>
        #include <linux/btrfs.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <sys/ioctl.h>
        #include <sys/stat.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static int rm_subvolume_by_id(int fd, uint64_t subvolid)
        {
      	 struct btrfs_ioctl_vol_args_v2 args = {};
      	 int ret;
      
      	 args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
      	 args.subvolid = subvolid;
      
      	 ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
      	 if (ret < 0)
      		 return -1;
      
      	 return 0;
        }
      
        int main(int argc, char *argv[])
        {
      	 int subvolid = 0;
      
      	 if (argc < 3)
      		 exit(1);
      
      	 fprintf(stderr, "Opening %s\n", argv[1]);
      	 int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
      	 if (fd < 0)
      		 exit(2);
      
      	 subvolid = atoi(argv[2]);
      
      	 fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
      	 int ret = rm_subvolume_by_id(fd, subvolid);
      	 if (ret < 0)
      		 exit(3);
      
      	 exit(0);
        }
        #include <stdio.h>"
        #include <stdlib.h>"
        #include <linux/btrfs.h"
      
        truncate -s 10G btrfs.img
        mkfs.btrfs btrfs.img
        export LOOPDEV=$(sudo losetup -f --show btrfs.img)
        mount ${LOOPDEV} /mnt
        sudo chown $(id -u):$(id -g) /mnt
        btrfs subvolume create /mnt/A
        btrfs subvolume create /mnt/B/C
        # Get subvolume id via:
        sudo btrfs subvolume show /mnt/A
        # Save subvolid
        SUBVOLID=<nr>
        sudo umount /mnt
        sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
        ./delete_by_spec /mnt ${SUBVOLID}
      
      With idmapped mounts this can potentially be used by users to delete
      subvolumes/snapshots they would otherwise not have access to as the
      idmapping would be applied to an inode that is not exposed in the mount
      of the subvolume.
      
      The fact that this is a filesystem wide operation suggests it might be a
      good idea to expose this under a separate ioctl that clearly indicates
      this. In essence, the file descriptor passed with the ioctl is merely
      used to identify the filesystem on which to operate when
      BTRFS_SUBVOL_SPEC_BY_ID is used.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4ed533b
    • Christian Brauner's avatar
      btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls · 4d4340c9
      Christian Brauner authored
      Creating subvolumes and snapshots is one of the core features of btrfs
      and is even available to unprivileged users. Make it possible to use
      subvolume and snapshot creation on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d4340c9
    • Christian Brauner's avatar
      btrfs: check whether fsgid/fsuid are mapped during subvolume creation · 5474bf40
      Christian Brauner authored
      When a new subvolume is created btrfs currently doesn't check whether
      the fsgid/fsuid of the caller actually have a mapping in the user
      namespace attached to the filesystem. The VFS always checks this to make
      sure that the caller's fsgid/fsuid can be represented on-disk. This is
      most relevant for filesystems that can be mounted inside user namespaces
      but it is in general a good hardening measure to prevent unrepresentable
      gid/uid from being written to disk.
      
      Since we want to support idmapped mounts for btrfs ioctls to create
      subvolumes in follow-up patches this becomes important since we want to
      make sure the fsgid/fsuid of the caller as mapped according to the
      idmapped mount can be represented on-disk. Simply add the missing
      fsuidgid_has_mapping() line from the VFS may_create() version to
      btrfs_may_create().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5474bf40
    • Christian Brauner's avatar
      btrfs: allow idmapped permission inode op · 3bc71ba0
      Christian Brauner authored
      Enable btrfs_permission() to handle idmapped mounts. This is just a
      matter of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bc71ba0
    • Christian Brauner's avatar
      btrfs: allow idmapped setattr inode op · d4d09464
      Christian Brauner authored
      Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4d09464
    • Christian Brauner's avatar
      btrfs: allow idmapped tmpfile inode op · 98b6ab5f
      Christian Brauner authored
      Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98b6ab5f
    • Christian Brauner's avatar
      btrfs: allow idmapped symlink inode op · 5a052108
      Christian Brauner authored
      Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a052108
    • Christian Brauner's avatar
      btrfs: allow idmapped mkdir inode op · b0b3e44d
      Christian Brauner authored
      Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0b3e44d
    • Christian Brauner's avatar
      btrfs: allow idmapped create inode op · e93ca491
      Christian Brauner authored
      Enable btrfs_create() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e93ca491
    • Christian Brauner's avatar
      btrfs: allow idmapped mknod inode op · 72105277
      Christian Brauner authored
      Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72105277
    • Christian Brauner's avatar
      btrfs: allow idmapped getattr inode op · c020d2ea
      Christian Brauner authored
      Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c020d2ea
    • Christian Brauner's avatar
      btrfs: allow idmapped rename inode op · ca07274c
      Christian Brauner authored
      Enable btrfs_rename() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca07274c
    • Christian Brauner's avatar
      btrfs: handle idmaps in btrfs_new_inode() · b3b6f5b9
      Christian Brauner authored
      Extend btrfs_new_inode() to take the idmapped mount into account when
      initializing a new inode. This is just a matter of passing down the
      mount's userns. The rest is taken care of in inode_init_owner(). This is
      a preliminary patch to make the individual btrfs inode operations
      idmapped mount aware.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3b6f5b9
    • Christian Brauner's avatar
      namei: add mapping aware lookup helper · c2fd68b6
      Christian Brauner authored
      Various filesystems rely on the lookup_one_len() helper to lookup a
      single path component relative to a well-known starting point. Allow
      such filesystems to support idmapped mounts by adding a version of this
      helper to take the idmap into account when calling inode_permission().
      This change is a required to let btrfs (and other filesystems) support
      idmapped mounts.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2fd68b6
    • Anand Jain's avatar
      btrfs: sysfs: document structures and their associated files · e7849e33
      Anand Jain authored
      Sysfs file has grown big. It takes some time to locate the correct
      struct attribute to add new files. Create a table and map the struct
      attribute to its sysfs path.
      
      Also, fix the comment about the debug sysfs path.  And add the comments
      to the attributes instead of attribute group, where sysfs file names are
      defined.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7849e33
    • Qu Wenruo's avatar
      btrfs: fix NULL pointer dereference when deleting device by invalid id · e4571b8c
      Qu Wenruo authored
      [BUG]
      It's easy to trigger NULL pointer dereference, just by removing a
      non-existing device id:
      
       # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
      				     /dev/test/scratch2
       # mount /dev/test/scratch1 /mnt/btrfs
       # btrfs device remove 3 /mnt/btrfs
      
      Then we have the following kernel NULL pointer dereference:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
        btrfs_ioctl+0x18bb/0x3190 [btrfs]
        ? lock_is_held_type+0xa5/0x120
        ? find_held_lock.constprop.0+0x2b/0x80
        ? do_user_addr_fault+0x201/0x6a0
        ? lock_release+0xd2/0x2d0
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [CAUSE]
      Commit a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return
      btrfs_device directly") moves the "missing" device path check into
      btrfs_rm_device().
      
      But btrfs_rm_device() itself can have case where it only receives
      @devid, with NULL as @device_path.
      
      In that case, calling strcmp() on NULL will trigger the NULL pointer
      dereference.
      
      Before that commit, we handle the "missing" case inside
      btrfs_find_device_by_devspec(), which will not check @device_path at all
      if @devid is provided, thus no way to trigger the bug.
      
      [FIX]
      Before calling strcmp(), also make sure @device_path is not NULL.
      
      Fixes: a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
      CC: stable@vger.kernel.org # 5.4+
      Reported-by: default avatarbutt3rflyh4ck <butterflyhuangxx@gmail.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4571b8c
    • Naohiro Aota's avatar
      btrfs: zoned: add asserts on splitting extent_map · 63fb5879
      Naohiro Aota authored
      We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
      we can assume the extent_map is PINNED, not LOGGING, and in the modified
      list. Add ASSERT()s to ensure the extent_maps after the split also has the
      proper flags set and are in the modified list.
      Suggested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63fb5879
    • Naohiro Aota's avatar
      btrfs: zoned: fix block group alloc_offset calculation · 0ae79c6f
      Naohiro Aota authored
      alloc_offset is offset from the start of a block group and @offset is
      actually an address in logical space. Thus, we need to consider
      block_group->start when calculating them.
      
      Fixes: 011b41bf ("btrfs: zoned: advance allocation pointer after tree log node")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ae79c6f
    • Naohiro Aota's avatar
      btrfs: zoned: suppress reclaim error message on EAGAIN · ba86dd9f
      Naohiro Aota authored
      btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
      running. The message can fail btrfs/187 and it's unnecessary because we
      anyway add it back to the reclaim list.
      
      btrfs_reclaim_bgs_work()
      `-> btrfs_relocate_chunk()
          `-> btrfs_relocate_block_group()
              `-> reloc_chunk_start()
                  `-> if (fs_info->send_in_progress)
                      `-> return -EAGAIN
      
      CC: stable@vger.kernel.org # 5.13+
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba86dd9f
    • Johannes Thumshirn's avatar
      btrfs: zoned: allow disabling of zone auto reclaim · 77233c2d
      Johannes Thumshirn authored
      Automatically reclaiming dirty zones might not always be desired for all
      workloads, especially as there are currently still some rough edges with
      the relocation code on zoned filesystems.
      
      Allow disabling zone auto reclaim on a per filesystem basis by writing 0
      as the threshold value.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77233c2d
    • Filipe Manana's avatar
      btrfs: update comment at log_conflicting_inodes() · 1f295373
      Filipe Manana authored
      A comment at log_conflicting_inodes() mentions that we check the inode's
      logged_trans field instead of using btrfs_inode_in_log() because the field
      last_log_commit is not updated when we log that an inode exists and the
      inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
      about the full sync flag is not true anymore since commit 9acc8103
      ("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
      update the comment to not mention that part anymore.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f295373
    • Filipe Manana's avatar
      btrfs: remove no longer needed full sync flag check at inode_logged() · d135a533
      Filipe Manana authored
      Now that we are checking if the inode's logged_trans is 0 to detect the
      possibility of the inode having been evicted and reloaded, the test for
      the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
      tree-log.c:inode_logged(). Its purpose was to detect the possibility
      of a previous eviction as well, since when an inode is loaded the full
      sync flag is always set on it (and only cleared after the inode is
      logged).
      
      So just remove the check and update the comment. The check for the inode's
      logged_trans being 0 was added recently by the patch with the subject
      "btrfs: eliminate some false positives when checking if inode was logged".
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d135a533
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL check for the new inode during rename exchange · 1c167b87
      Filipe Manana authored
      At the very end of btrfs_rename_exchange(), in case an error happened, we
      are checking if 'new_inode' is NULL, but that is not needed since during a
      rename exchange, unlike regular renames, 'new_inode' can never be NULL,
      and if it were, we would have a crashed much earlier when we dereference it
      multiple times.
      
      So remove the check because it is not necessary and because it is causing
      static checkers to emit a warning. I probably introduced the check by
      copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
      NULL, in commit 86e8aa0e ("Btrfs: unpin logs if rename exchange
      operation fails").
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c167b87
    • Goldwyn Rodrigues's avatar
      btrfs: allocate backref_ctx on stack in find_extent_clone · dce28150
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
      on stack. The size is reasonably small.
      
      sizeof(backref_ctx) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dce28150
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_defrag_range_args on stack · c853a578
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
      allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_defrag_range_args) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c853a578
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_quota_rescan_args on stack · 0afb603a
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
      allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_quota_rescan_args) = 64
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0afb603a
    • Goldwyn Rodrigues's avatar
      btrfs: allocate file_ra_state on stack in readahead_cache · 98caf953
      Goldwyn Rodrigues authored
      Instead of allocating file_ra_state using kmalloc, allocate on stack.
      sizeof(struct readahead) = 32 bytes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98caf953
    • Marcos Paulo de Souza's avatar
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza authored
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ff40a91
    • David Sterba's avatar
      btrfs: print if fsverity support is built in when loading module · ea3dc7d2
      David Sterba authored
      As fsverity support depends on a config option, print that at module
      load time like we do for similar features.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea3dc7d2
    • Boris Burkov's avatar
      btrfs: verity metadata orphan items · 70524253
      Boris Burkov authored
      Writing out the verity data is too large of an operation to do in a
      single transaction. If we are interrupted before we finish creating
      fsverity metadata for a file, or fail to clean up already created
      metadata after a failure, we could leak the verity items that we already
      committed.
      
      To address this issue, we use the orphan mechanism. When we start
      enabling verity on a file, we also add an orphan item for that inode.
      When we are finished, we delete the orphan. However, if we are
      interrupted midway, the orphan will be present at mount and we can
      cleanup the half-formed verity state.
      
      There is a possible race with a normal unlink operation: if unlink and
      verity run on the same file in parallel, it is possible for verity to
      succeed and delete the still legitimate orphan added by unlink. Then, if
      we are interrupted and mount in that state, we will never clean up the
      inode properly. This is also possible for a file created with O_TMPFILE.
      Check nlink==0 before deleting to avoid this race.
      
      A final thing to note is that this is a resurrection of using orphans to
      signal an operation besides "delete this inode". The old case was to
      signal the need to do a truncate. That case still technically applies
      for mounting very old file systems, so we need to take some care to not
      clobber it. To that end, we just have to be careful that verity orphan
      cleanup is a no-op for non-verity files.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      70524253
    • Boris Burkov's avatar
      btrfs: initial fsverity support · 14605409
      Boris Burkov authored
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: default avatarEric Biggers <ebiggers@google.com>
      Co-developed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14605409
    • Boris Burkov's avatar
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov authored
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77eea05e
    • Anand Jain's avatar
      btrfs: simplify return values in btrfs_check_raid_min_devices · efc222f8
      Anand Jain authored
      Function btrfs_check_raid_min_devices() returns error code from the enum
      btrfs_err_code and it starts from 1. So there is no need to check if ret
      is > 0. So drop this check and also drop the local variable ret.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efc222f8
    • Qu Wenruo's avatar
      btrfs: remove the dead comment in writepage_delalloc() · 7361b4ae
      Qu Wenruo authored
      When btrfs_run_delalloc_range() failed, we will error out.
      
      But there is a strange comment mentioning that
      btrfs_run_delalloc_range() could have returned value >0 to indicate the
      IO has already started.
      
      Commit 40f76580 ("Btrfs: split up __extent_writepage to lower stack
      usage") introduced the comment, but unfortunately at that time, we were
      already using @page_started to indicate that case, and still return 0.
      
      Furthermore, even if that comment was right (which is not), we would
      return -EIO if the IO had already started.
      
      By all means the comment is incorrect, just remove the comment along
      with the dead check.
      
      Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
      make sure we either return 0 or error, no positive return value.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7361b4ae
    • David Sterba's avatar
      btrfs: allow degenerate raid0/raid10 · b2f78e88
      David Sterba authored
      The data on raid0 and raid10 are supposed to be spread over multiple
      devices, so the minimum constraints are set to 2 and 4 respectively.
      This is an artificial limit and there's some interest to remove it.
      
      Change this to allow raid0 on one device and raid10 on two devices. This
      works as expected eg. when converting or removing devices.
      
      The only difference is when raid0 on two devices gets one device
      removed. Unpatched would silently create a single profile, while newly
      it would be raid0.
      
      The motivation is to allow to preserve the profile type as long as it
      possible for some intermediate state (device removal, conversion), or
      when there are disks of different size, with raid0 the otherwise
      unusable space of the last device will be used too. Similarly for
      raid10, though the two largest devices would need to be the same.
      
      Unpatched kernel will mount and use the degenerate profiles just fine
      but won't allow any operation that would not satisfy the stricter device
      number constraints, eg. not allowing to go from 3 to 2 devices for
      raid10 or various profile conversions.
      
      Example output:
      
        # btrfs fi us -T .
        Overall:
            Device size:                  10.00GiB
            Device allocated:              1.01GiB
            Device unallocated:            8.99GiB
            Device missing:                  0.00B
            Used:                        200.61MiB
            Free (estimated):              9.79GiB      (min: 9.79GiB)
            Free (statfs, df):             9.79GiB
            Data ratio:                       1.00
            Metadata ratio:                   1.00
            Global reserve:                3.25MiB      (used: 0.00B)
            Multiple profiles:                  no
      
      		Data      Metadata  System
        Id Path       RAID0     single    single   Unallocated
        -- ---------- --------- --------- -------- -----------
         1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
        -- ---------- --------- --------- -------- -----------
           Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
           Used       200.25MiB 352.00KiB 16.00KiB
      
        # btrfs dev us .
        /dev/sda10, ID: 1
           Device size:            10.00GiB
           Device slack:              0.00B
           Data,RAID0/1:            1.00GiB
           Metadata,single:         8.00MiB
           System,single:           1.00MiB
           Unallocated:             8.99GiB
      
      Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
      profile is printed.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2f78e88
    • Filipe Manana's avatar
      btrfs: do not pin logs too early during renames · bd54f381
      Filipe Manana authored
      During renames we pin the logs of the roots a bit too early, before the
      calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
      since those will not change anything in a log tree.
      
      In a scenario where we have multiple and diverse filesystem operations
      running in parallel, those calls can take a significant amount of time,
      due to lock contention on extent buffers, and delay log commits from other
      tasks for longer than necessary.
      
      So just pin logs after calls to btrfs_insert_inode_ref() and right before
      the first operation that can update a log tree.
      
      The following script that uses dbench was used for testing:
      
        $ cat dbench-test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 16
      
        umount $MNT
      
      The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
      and using a non-debug kernel config (Debian's default config).
      
      The results compare a branch without this patch and without the previous
      patch in the series, that has the subject:
      
       "btrfs: eliminate some false positives when checking if inode was logged"
      
      Versus the same branch with these two patches applied.
      
      dbench with 8 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4391359     0.009   249.745
       Close        3225882     0.001     3.243
       Rename        185953     0.065   240.643
       Unlink        886669     0.049   249.906
       Deltree          112     2.455   217.433
       Mkdir             56     0.002     0.004
       Qpathinfo    3980281     0.004     3.109
       Qfileinfo     697579     0.001     0.187
       Qfsinfo       729780     0.002     2.424
       Sfileinfo     357764     0.004     1.415
       Find         1538861     0.016     4.863
       WriteX       2189666     0.010     3.327
       ReadX        6883443     0.002     0.729
       LockX          14298     0.002     0.073
       UnlockX        14298     0.001     0.042
       Flush         307777     2.447   303.663
      
      Throughput 1149.6 MB/sec  8 clients  8 procs  max_latency=303.666 ms
      
      dbench with 8 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4269920     0.009   213.532
       Close        3136653     0.001     0.690
       Rename        180805     0.082   213.858
       Unlink        862189     0.050   172.893
       Deltree          112     2.998   218.328
       Mkdir             56     0.002     0.003
       Qpathinfo    3870158     0.004     5.072
       Qfileinfo     678375     0.001     0.194
       Qfsinfo       709604     0.002     0.485
       Sfileinfo     347850     0.004     1.304
       Find         1496310     0.017     5.504
       WriteX       2129613     0.010     2.882
       ReadX        6693066     0.002     1.517
       LockX          13902     0.002     0.075
       UnlockX        13902     0.001     0.055
       Flush         299276     2.511   220.189
      
      Throughput 1187.33 MB/sec  8 clients  8 procs  max_latency=220.194 ms
      
      +3.2% throughput, -31.8% max latency
      
      dbench with 16 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5978334     0.028   156.507
       Close        4391598     0.001     1.345
       Rename        253136     0.241   155.057
       Unlink       1207220     0.182   257.344
       Deltree          160     6.123    36.277
       Mkdir             80     0.003     0.005
       Qpathinfo    5418817     0.012     6.867
       Qfileinfo     949929     0.001     0.941
       Qfsinfo       993560     0.002     1.386
       Sfileinfo     486904     0.004     2.829
       Find         2095088     0.059     8.164
       WriteX       2982319     0.017     9.029
       ReadX        9371484     0.002     4.052
       LockX          19470     0.002     0.461
       UnlockX        19470     0.001     0.990
       Flush         418936     2.740   347.902
      
      Throughput 1495.31 MB/sec  16 clients  16 procs  max_latency=347.909 ms
      
      dbench with 16 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5711833     0.029   131.240
       Close        4195897     0.001     1.732
       Rename        241849     0.204   147.831
       Unlink       1153341     0.184   231.322
       Deltree          160     6.086    30.198
       Mkdir             80     0.003     0.021
       Qpathinfo    5177011     0.012     7.150
       Qfileinfo     907768     0.001     0.793
       Qfsinfo       949205     0.002     1.431
       Sfileinfo     465317     0.004     2.454
       Find         2001541     0.058     7.819
       WriteX       2850661     0.017     9.110
       ReadX        8952289     0.002     3.991
       LockX          18596     0.002     0.655
       UnlockX        18596     0.001     0.179
       Flush         400342     2.879   293.607
      
      Throughput 1565.73 MB/sec  16 clients  16 procs  max_latency=293.611 ms
      
      +4.6% throughput, -16.9% max latency
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd54f381