1. 23 Aug, 2021 40 commits
    • Christian Brauner's avatar
      btrfs: allow idmapped mknod inode op · 72105277
      Christian Brauner authored
      Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72105277
    • Christian Brauner's avatar
      btrfs: allow idmapped getattr inode op · c020d2ea
      Christian Brauner authored
      Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c020d2ea
    • Christian Brauner's avatar
      btrfs: allow idmapped rename inode op · ca07274c
      Christian Brauner authored
      Enable btrfs_rename() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca07274c
    • Christian Brauner's avatar
      btrfs: handle idmaps in btrfs_new_inode() · b3b6f5b9
      Christian Brauner authored
      Extend btrfs_new_inode() to take the idmapped mount into account when
      initializing a new inode. This is just a matter of passing down the
      mount's userns. The rest is taken care of in inode_init_owner(). This is
      a preliminary patch to make the individual btrfs inode operations
      idmapped mount aware.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3b6f5b9
    • Christian Brauner's avatar
      namei: add mapping aware lookup helper · c2fd68b6
      Christian Brauner authored
      Various filesystems rely on the lookup_one_len() helper to lookup a
      single path component relative to a well-known starting point. Allow
      such filesystems to support idmapped mounts by adding a version of this
      helper to take the idmap into account when calling inode_permission().
      This change is a required to let btrfs (and other filesystems) support
      idmapped mounts.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2fd68b6
    • Anand Jain's avatar
      btrfs: sysfs: document structures and their associated files · e7849e33
      Anand Jain authored
      Sysfs file has grown big. It takes some time to locate the correct
      struct attribute to add new files. Create a table and map the struct
      attribute to its sysfs path.
      
      Also, fix the comment about the debug sysfs path.  And add the comments
      to the attributes instead of attribute group, where sysfs file names are
      defined.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7849e33
    • Qu Wenruo's avatar
      btrfs: fix NULL pointer dereference when deleting device by invalid id · e4571b8c
      Qu Wenruo authored
      [BUG]
      It's easy to trigger NULL pointer dereference, just by removing a
      non-existing device id:
      
       # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
      				     /dev/test/scratch2
       # mount /dev/test/scratch1 /mnt/btrfs
       # btrfs device remove 3 /mnt/btrfs
      
      Then we have the following kernel NULL pointer dereference:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
        btrfs_ioctl+0x18bb/0x3190 [btrfs]
        ? lock_is_held_type+0xa5/0x120
        ? find_held_lock.constprop.0+0x2b/0x80
        ? do_user_addr_fault+0x201/0x6a0
        ? lock_release+0xd2/0x2d0
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [CAUSE]
      Commit a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return
      btrfs_device directly") moves the "missing" device path check into
      btrfs_rm_device().
      
      But btrfs_rm_device() itself can have case where it only receives
      @devid, with NULL as @device_path.
      
      In that case, calling strcmp() on NULL will trigger the NULL pointer
      dereference.
      
      Before that commit, we handle the "missing" case inside
      btrfs_find_device_by_devspec(), which will not check @device_path at all
      if @devid is provided, thus no way to trigger the bug.
      
      [FIX]
      Before calling strcmp(), also make sure @device_path is not NULL.
      
      Fixes: a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
      CC: stable@vger.kernel.org # 5.4+
      Reported-by: default avatarbutt3rflyh4ck <butterflyhuangxx@gmail.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4571b8c
    • Naohiro Aota's avatar
      btrfs: zoned: add asserts on splitting extent_map · 63fb5879
      Naohiro Aota authored
      We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
      we can assume the extent_map is PINNED, not LOGGING, and in the modified
      list. Add ASSERT()s to ensure the extent_maps after the split also has the
      proper flags set and are in the modified list.
      Suggested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63fb5879
    • Naohiro Aota's avatar
      btrfs: zoned: fix block group alloc_offset calculation · 0ae79c6f
      Naohiro Aota authored
      alloc_offset is offset from the start of a block group and @offset is
      actually an address in logical space. Thus, we need to consider
      block_group->start when calculating them.
      
      Fixes: 011b41bf ("btrfs: zoned: advance allocation pointer after tree log node")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ae79c6f
    • Naohiro Aota's avatar
      btrfs: zoned: suppress reclaim error message on EAGAIN · ba86dd9f
      Naohiro Aota authored
      btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
      running. The message can fail btrfs/187 and it's unnecessary because we
      anyway add it back to the reclaim list.
      
      btrfs_reclaim_bgs_work()
      `-> btrfs_relocate_chunk()
          `-> btrfs_relocate_block_group()
              `-> reloc_chunk_start()
                  `-> if (fs_info->send_in_progress)
                      `-> return -EAGAIN
      
      CC: stable@vger.kernel.org # 5.13+
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba86dd9f
    • Johannes Thumshirn's avatar
      btrfs: zoned: allow disabling of zone auto reclaim · 77233c2d
      Johannes Thumshirn authored
      Automatically reclaiming dirty zones might not always be desired for all
      workloads, especially as there are currently still some rough edges with
      the relocation code on zoned filesystems.
      
      Allow disabling zone auto reclaim on a per filesystem basis by writing 0
      as the threshold value.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77233c2d
    • Filipe Manana's avatar
      btrfs: update comment at log_conflicting_inodes() · 1f295373
      Filipe Manana authored
      A comment at log_conflicting_inodes() mentions that we check the inode's
      logged_trans field instead of using btrfs_inode_in_log() because the field
      last_log_commit is not updated when we log that an inode exists and the
      inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
      about the full sync flag is not true anymore since commit 9acc8103
      ("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
      update the comment to not mention that part anymore.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f295373
    • Filipe Manana's avatar
      btrfs: remove no longer needed full sync flag check at inode_logged() · d135a533
      Filipe Manana authored
      Now that we are checking if the inode's logged_trans is 0 to detect the
      possibility of the inode having been evicted and reloaded, the test for
      the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
      tree-log.c:inode_logged(). Its purpose was to detect the possibility
      of a previous eviction as well, since when an inode is loaded the full
      sync flag is always set on it (and only cleared after the inode is
      logged).
      
      So just remove the check and update the comment. The check for the inode's
      logged_trans being 0 was added recently by the patch with the subject
      "btrfs: eliminate some false positives when checking if inode was logged".
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d135a533
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL check for the new inode during rename exchange · 1c167b87
      Filipe Manana authored
      At the very end of btrfs_rename_exchange(), in case an error happened, we
      are checking if 'new_inode' is NULL, but that is not needed since during a
      rename exchange, unlike regular renames, 'new_inode' can never be NULL,
      and if it were, we would have a crashed much earlier when we dereference it
      multiple times.
      
      So remove the check because it is not necessary and because it is causing
      static checkers to emit a warning. I probably introduced the check by
      copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
      NULL, in commit 86e8aa0e ("Btrfs: unpin logs if rename exchange
      operation fails").
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c167b87
    • Goldwyn Rodrigues's avatar
      btrfs: allocate backref_ctx on stack in find_extent_clone · dce28150
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
      on stack. The size is reasonably small.
      
      sizeof(backref_ctx) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dce28150
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_defrag_range_args on stack · c853a578
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
      allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_defrag_range_args) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c853a578
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_quota_rescan_args on stack · 0afb603a
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
      allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_quota_rescan_args) = 64
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0afb603a
    • Goldwyn Rodrigues's avatar
      btrfs: allocate file_ra_state on stack in readahead_cache · 98caf953
      Goldwyn Rodrigues authored
      Instead of allocating file_ra_state using kmalloc, allocate on stack.
      sizeof(struct readahead) = 32 bytes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98caf953
    • Marcos Paulo de Souza's avatar
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza authored
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ff40a91
    • David Sterba's avatar
      btrfs: print if fsverity support is built in when loading module · ea3dc7d2
      David Sterba authored
      As fsverity support depends on a config option, print that at module
      load time like we do for similar features.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea3dc7d2
    • Boris Burkov's avatar
      btrfs: verity metadata orphan items · 70524253
      Boris Burkov authored
      Writing out the verity data is too large of an operation to do in a
      single transaction. If we are interrupted before we finish creating
      fsverity metadata for a file, or fail to clean up already created
      metadata after a failure, we could leak the verity items that we already
      committed.
      
      To address this issue, we use the orphan mechanism. When we start
      enabling verity on a file, we also add an orphan item for that inode.
      When we are finished, we delete the orphan. However, if we are
      interrupted midway, the orphan will be present at mount and we can
      cleanup the half-formed verity state.
      
      There is a possible race with a normal unlink operation: if unlink and
      verity run on the same file in parallel, it is possible for verity to
      succeed and delete the still legitimate orphan added by unlink. Then, if
      we are interrupted and mount in that state, we will never clean up the
      inode properly. This is also possible for a file created with O_TMPFILE.
      Check nlink==0 before deleting to avoid this race.
      
      A final thing to note is that this is a resurrection of using orphans to
      signal an operation besides "delete this inode". The old case was to
      signal the need to do a truncate. That case still technically applies
      for mounting very old file systems, so we need to take some care to not
      clobber it. To that end, we just have to be careful that verity orphan
      cleanup is a no-op for non-verity files.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      70524253
    • Boris Burkov's avatar
      btrfs: initial fsverity support · 14605409
      Boris Burkov authored
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: default avatarEric Biggers <ebiggers@google.com>
      Co-developed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14605409
    • Boris Burkov's avatar
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov authored
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77eea05e
    • Anand Jain's avatar
      btrfs: simplify return values in btrfs_check_raid_min_devices · efc222f8
      Anand Jain authored
      Function btrfs_check_raid_min_devices() returns error code from the enum
      btrfs_err_code and it starts from 1. So there is no need to check if ret
      is > 0. So drop this check and also drop the local variable ret.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efc222f8
    • Qu Wenruo's avatar
      btrfs: remove the dead comment in writepage_delalloc() · 7361b4ae
      Qu Wenruo authored
      When btrfs_run_delalloc_range() failed, we will error out.
      
      But there is a strange comment mentioning that
      btrfs_run_delalloc_range() could have returned value >0 to indicate the
      IO has already started.
      
      Commit 40f76580 ("Btrfs: split up __extent_writepage to lower stack
      usage") introduced the comment, but unfortunately at that time, we were
      already using @page_started to indicate that case, and still return 0.
      
      Furthermore, even if that comment was right (which is not), we would
      return -EIO if the IO had already started.
      
      By all means the comment is incorrect, just remove the comment along
      with the dead check.
      
      Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
      make sure we either return 0 or error, no positive return value.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7361b4ae
    • David Sterba's avatar
      btrfs: allow degenerate raid0/raid10 · b2f78e88
      David Sterba authored
      The data on raid0 and raid10 are supposed to be spread over multiple
      devices, so the minimum constraints are set to 2 and 4 respectively.
      This is an artificial limit and there's some interest to remove it.
      
      Change this to allow raid0 on one device and raid10 on two devices. This
      works as expected eg. when converting or removing devices.
      
      The only difference is when raid0 on two devices gets one device
      removed. Unpatched would silently create a single profile, while newly
      it would be raid0.
      
      The motivation is to allow to preserve the profile type as long as it
      possible for some intermediate state (device removal, conversion), or
      when there are disks of different size, with raid0 the otherwise
      unusable space of the last device will be used too. Similarly for
      raid10, though the two largest devices would need to be the same.
      
      Unpatched kernel will mount and use the degenerate profiles just fine
      but won't allow any operation that would not satisfy the stricter device
      number constraints, eg. not allowing to go from 3 to 2 devices for
      raid10 or various profile conversions.
      
      Example output:
      
        # btrfs fi us -T .
        Overall:
            Device size:                  10.00GiB
            Device allocated:              1.01GiB
            Device unallocated:            8.99GiB
            Device missing:                  0.00B
            Used:                        200.61MiB
            Free (estimated):              9.79GiB      (min: 9.79GiB)
            Free (statfs, df):             9.79GiB
            Data ratio:                       1.00
            Metadata ratio:                   1.00
            Global reserve:                3.25MiB      (used: 0.00B)
            Multiple profiles:                  no
      
      		Data      Metadata  System
        Id Path       RAID0     single    single   Unallocated
        -- ---------- --------- --------- -------- -----------
         1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
        -- ---------- --------- --------- -------- -----------
           Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
           Used       200.25MiB 352.00KiB 16.00KiB
      
        # btrfs dev us .
        /dev/sda10, ID: 1
           Device size:            10.00GiB
           Device slack:              0.00B
           Data,RAID0/1:            1.00GiB
           Metadata,single:         8.00MiB
           System,single:           1.00MiB
           Unallocated:             8.99GiB
      
      Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
      profile is printed.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2f78e88
    • Filipe Manana's avatar
      btrfs: do not pin logs too early during renames · bd54f381
      Filipe Manana authored
      During renames we pin the logs of the roots a bit too early, before the
      calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
      since those will not change anything in a log tree.
      
      In a scenario where we have multiple and diverse filesystem operations
      running in parallel, those calls can take a significant amount of time,
      due to lock contention on extent buffers, and delay log commits from other
      tasks for longer than necessary.
      
      So just pin logs after calls to btrfs_insert_inode_ref() and right before
      the first operation that can update a log tree.
      
      The following script that uses dbench was used for testing:
      
        $ cat dbench-test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 16
      
        umount $MNT
      
      The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
      and using a non-debug kernel config (Debian's default config).
      
      The results compare a branch without this patch and without the previous
      patch in the series, that has the subject:
      
       "btrfs: eliminate some false positives when checking if inode was logged"
      
      Versus the same branch with these two patches applied.
      
      dbench with 8 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4391359     0.009   249.745
       Close        3225882     0.001     3.243
       Rename        185953     0.065   240.643
       Unlink        886669     0.049   249.906
       Deltree          112     2.455   217.433
       Mkdir             56     0.002     0.004
       Qpathinfo    3980281     0.004     3.109
       Qfileinfo     697579     0.001     0.187
       Qfsinfo       729780     0.002     2.424
       Sfileinfo     357764     0.004     1.415
       Find         1538861     0.016     4.863
       WriteX       2189666     0.010     3.327
       ReadX        6883443     0.002     0.729
       LockX          14298     0.002     0.073
       UnlockX        14298     0.001     0.042
       Flush         307777     2.447   303.663
      
      Throughput 1149.6 MB/sec  8 clients  8 procs  max_latency=303.666 ms
      
      dbench with 8 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4269920     0.009   213.532
       Close        3136653     0.001     0.690
       Rename        180805     0.082   213.858
       Unlink        862189     0.050   172.893
       Deltree          112     2.998   218.328
       Mkdir             56     0.002     0.003
       Qpathinfo    3870158     0.004     5.072
       Qfileinfo     678375     0.001     0.194
       Qfsinfo       709604     0.002     0.485
       Sfileinfo     347850     0.004     1.304
       Find         1496310     0.017     5.504
       WriteX       2129613     0.010     2.882
       ReadX        6693066     0.002     1.517
       LockX          13902     0.002     0.075
       UnlockX        13902     0.001     0.055
       Flush         299276     2.511   220.189
      
      Throughput 1187.33 MB/sec  8 clients  8 procs  max_latency=220.194 ms
      
      +3.2% throughput, -31.8% max latency
      
      dbench with 16 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5978334     0.028   156.507
       Close        4391598     0.001     1.345
       Rename        253136     0.241   155.057
       Unlink       1207220     0.182   257.344
       Deltree          160     6.123    36.277
       Mkdir             80     0.003     0.005
       Qpathinfo    5418817     0.012     6.867
       Qfileinfo     949929     0.001     0.941
       Qfsinfo       993560     0.002     1.386
       Sfileinfo     486904     0.004     2.829
       Find         2095088     0.059     8.164
       WriteX       2982319     0.017     9.029
       ReadX        9371484     0.002     4.052
       LockX          19470     0.002     0.461
       UnlockX        19470     0.001     0.990
       Flush         418936     2.740   347.902
      
      Throughput 1495.31 MB/sec  16 clients  16 procs  max_latency=347.909 ms
      
      dbench with 16 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5711833     0.029   131.240
       Close        4195897     0.001     1.732
       Rename        241849     0.204   147.831
       Unlink       1153341     0.184   231.322
       Deltree          160     6.086    30.198
       Mkdir             80     0.003     0.021
       Qpathinfo    5177011     0.012     7.150
       Qfileinfo     907768     0.001     0.793
       Qfsinfo       949205     0.002     1.431
       Sfileinfo     465317     0.004     2.454
       Find         2001541     0.058     7.819
       WriteX       2850661     0.017     9.110
       ReadX        8952289     0.002     3.991
       LockX          18596     0.002     0.655
       UnlockX        18596     0.001     0.179
       Flush         400342     2.879   293.607
      
      Throughput 1565.73 MB/sec  16 clients  16 procs  max_latency=293.611 ms
      
      +4.6% throughput, -16.9% max latency
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd54f381
    • Filipe Manana's avatar
      btrfs: eliminate some false positives when checking if inode was logged · 6e8e777d
      Filipe Manana authored
      When checking if an inode was previously logged in the current transaction
      through the helper inode_logged(), we can return some false positives that
      can be easily eliminated. These correspond to the cases where an inode has
      a ->logged_trans value that is not zero and its value is smaller then the
      ID of the current transaction. This means we know exactly that the inode
      was never logged before in the current transaction, so we can return false
      and avoid the callers to do extra work:
      
      1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
         unnecessarily join a log transaction and do deletion searches in a log
         tree that will not find anything. This just adds unnecessary contention
         on extent buffer locks;
      
      2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
         needed. If the inode was not logged before, we don't need to log it in
         LOG_INODE_EXISTS mode.
      
      So just make sure that any false positive only happens when ->logged_trans
      has a value of 0.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e8e777d
    • Naohiro Aota's avatar
      btrfs: drop unnecessary ASSERT from btrfs_submit_direct() · 42b5d73b
      Naohiro Aota authored
      When on SINGLE block group, btrfs_get_io_geometry() will return "the
      size of the block group - the offset of the logical address within the
      block group" as geom.len. Since we allow up to 8 GiB zone size on zoned
      filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB
      geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <=
      INT_MAX);".
      
      The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim()
      which both take "int" (now u64 due to the previous patch). So to be
      precise the ASSERT should check if clone_len <= UINT_MAX. But actually,
      clone_len is already capped by bio.bi_iter.bi_size which is unsigned
      int. So the ASSERT is not necessary.
      
      Drop the ASSERT and properly compare submit_len and geom.len in u64.
      Then, let the implicit casting to convert it to u64.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42b5d73b
    • Chaitanya Kulkarni's avatar
      btrfs: fix argument type of btrfs_bio_clone_partial() · 21dda654
      Chaitanya Kulkarni authored
      The offset and can never be negative use unsigned int instead of int
      type for them.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21dda654
    • Chaitanya Kulkarni's avatar
      block: fix argument type of bio_trim() · e83502ca
      Chaitanya Kulkarni authored
      The function bio_trim has offset and size arguments that are declared
      as int.
      
      The callers of this function use sector_t type when passing the offset
      and size, e.g. drivers/md/raid1.c:narrow_write_error() and
      drivers/md/raid1.c:narrow_write_error().
      
      Change offset and size arguments to sector_t type for bio_trim(). Also,
      add WARN_ON_ONCE() to catch their overflow.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e83502ca
    • Josef Bacik's avatar
      fs: kill sync_inode · 5662c967
      Josef Bacik authored
      Now that all users of sync_inode() have been deleted, remove
      sync_inode().
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5662c967
    • Josef Bacik's avatar
      9p: migrate from sync_inode to filemap_fdatawrite_wbc · 25d23cd0
      Josef Bacik authored
      We're going to remove sync_inode, so migrate to filemap_fdatawrite_wbc
      instead.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25d23cd0
    • Josef Bacik's avatar
      btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking · b3776305
      Josef Bacik authored
      sync_inode() has some holes that can cause problems if we're under heavy
      ENOSPC pressure.  If there's writeback running on a separate thread
      sync_inode() will skip writing the inode altogether.  What we really
      want is to make sure writeback has been started on all the pages to make
      sure we can see the ordered extents and wait on them if appropriate.
      Switch to this new helper which will allow us to accomplish this and
      avoid ENOSPC'ing early.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3776305
    • Josef Bacik's avatar
      fs: add a filemap_fdatawrite_wbc helper · 5a798493
      Josef Bacik authored
      Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
      order to reclaim metadata reservations.  Unfortunately most helpers in
      this area are too smart for us:
      
      1) The normal filemap_fdata* helpers only take range and sync modes, and
         don't give any indication of how much was written, so we can only
         flush full inodes, which isn't what we want in most cases.
      2) The normal writeback path requires us to have the s_umount sem held,
         but we can't unconditionally take it in this path because we could
         deadlock.
      3) The normal writeback path also skips inodes with I_SYNC set if we
         write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
         ENOSPC pressure, we want to actually make sure the pages are under
         writeback before returning, and if another thread is in the middle of
         writing the file we may return before they're under writeback and
         miss our ordered extents and not properly wait for completion.
      4) sync_inode() uses the normal writeback path and has the same problem
         as #3.
      
      What we really want is to call do_writepages() with our wbc.  This way
      we can make sure that writeback is actually started on the pages, and we
      can control how many pages are written as a whole as we write many
      inodes using the same wbc.  Accomplish this with a new helper that does
      just that so we can use it for our ENOSPC flushing infrastructure.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a798493
    • Josef Bacik's avatar
      btrfs: wait on async extents when flushing delalloc · e1646070
      Josef Bacik authored
      I've been debugging an early ENOSPC problem in production and finally
      root caused it to this problem.  When we switched to the per-inode in
      38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") I pulled out the async extent handling, because we
      were doing the correct thing by calling filemap_flush() if we had async
      extents set.  This would properly wait on any async extents by locking
      the page in the second flush, thus making sure our ordered extents were
      properly set up.
      
      However when I switched us back to page based flushing, I used
      sync_inode(), which allows us to pass in our own wbc.  The problem here
      is that sync_inode() is smarter than the filemap_* helpers, it tries to
      avoid calling writepages at all.  This means that our second call could
      skip calling do_writepages altogether, and thus not wait on the pagelock
      for the async helpers.  This means we could come back before any ordered
      extents were created and then simply continue on in our flushing
      mechanisms and ENOSPC out when we have plenty of space to use.
      
      Fix this by putting back the async pages logic in shrink_delalloc.  This
      allows us to bulk write out everything that we need to, and then we can
      wait in one place for the async helpers to catch up, and then wait on
      any ordered extents that are created.
      
      Fixes: e076ab2a ("btrfs: shrink delalloc pages instead of full inodes")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1646070
    • Josef Bacik's avatar
      btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
      Josef Bacik authored
      We have been hitting some early ENOSPC issues in production with more
      recent kernels, and I tracked it down to us simply not flushing delalloc
      as aggressively as we should be.  With tracing I was seeing us failing
      all tickets with all of the block rsvs at or around 0, with very little
      pinned space, but still around 120MiB of outstanding bytes_may_used.
      Upon further investigation I saw that we were flushing around 14 pages
      per shrink call for delalloc, despite having around 2GiB of delalloc
      outstanding.
      
      Consider the example of a 8 way machine, all CPUs trying to create a
      file in parallel, which at the time of this commit requires 5 items to
      do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
      size waiting on reservations.  Now assume we have 128MiB of delalloc
      outstanding.  With our current math we would set items to 20, and then
      set to_reclaim to 20 * 256k, or 5MiB.
      
      Assuming that we went through this loop all 3 times, for both
      FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
      twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
      leave a fair bit of delalloc reservations still hanging around by the
      time we go to ENOSPC out all the remaining tickets.
      
      Fix this two ways.  First, change the calculations to be a fraction of
      the total delalloc bytes on the system.  Prior to this change we were
      calculating based on dirty inodes so our math made more sense, now it's
      just completely unrelated to what we're actually doing.
      
      Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
      gone through the flush states at least once.  This will empty the system
      of all delalloc so we're sure to be truly out of space when we start
      failing tickets.
      
      I'm tagging stable 5.10 and forward, because this is where we started
      using the page stuff heavily again.  This affects earlier kernel
      versions as well, but would be a pain to backport to them as the
      flushing mechanisms aren't the same.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03fe78cc
    • Josef Bacik's avatar
      btrfs: enable a tracepoint when we fail tickets · fcdef39c
      Josef Bacik authored
      When debugging early enospc problems it was useful to have a tracepoint
      where we failed all tickets so I could check the state of the enospc
      counters at failure time to validate my fixes.  This adds the tracpoint
      so you can easily get that information.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fcdef39c
    • Josef Bacik's avatar
      btrfs: include delalloc related info in dump space info tracepoint · 8197766d
      Josef Bacik authored
      In order to debug delalloc flushing issues I added delalloc_bytes and
      ordered_bytes to this tracepoint to see if they were non-zero when we
      were going ENOSPC. This was valuable for me and showed me cases where we
      weren't waiting on ordered extents properly. In order to add this to the
      tracepoint we need to take away the const modifier for fs_info, as
      percpu_sum_counter_positive() will change the counter when it adds up
      the percpu buckets.  This is needed to make sure we're getting accurate
      information at these tracepoints, as the wrong information could send us
      down the wrong path when debugging problems.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8197766d
    • Josef Bacik's avatar
      btrfs: wake up async_delalloc_pages waiters after submit · ac98141d
      Josef Bacik authored
      We use the async_delalloc_pages mechanism to make sure that we've
      completed our async work before trying to continue our delalloc
      flushing.  The reason for this is we need to see any ordered extents
      that were created by our delalloc flushing.  However we're waking up
      before we do the submit work, which is before we create the ordered
      extents.  This is a pretty wide race window where we could potentially
      think there are no ordered extents and thus exit shrink_delalloc
      prematurely.  Fix this by waking us up after we've done the work to
      create ordered extents.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac98141d