1. 22 Feb, 2024 4 commits
    • Filipe Manana's avatar
      btrfs: fix data races when accessing the reserved amount of block reserves · e06cc894
      Filipe Manana authored
      At space_info.c we have several places where we access the ->reserved
      field of a block reserve without taking the block reserve's spinlock
      first, which makes KCSAN warn about a data race since that field is
      always updated while holding the spinlock.
      
      The reports from KCSAN are like the following:
      
        [117.193526] BUG: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / need_preemptive_reclaim [btrfs]
      
        [117.195148] read to 0x000000017f587190 of 8 bytes by task 6303 on cpu 3:
        [117.195172]  need_preemptive_reclaim+0x222/0x2f0 [btrfs]
        [117.195992]  __reserve_bytes+0xbb0/0xdc8 [btrfs]
        [117.196807]  btrfs_reserve_metadata_bytes+0x4c/0x120 [btrfs]
        [117.197620]  btrfs_block_rsv_add+0x78/0xa8 [btrfs]
        [117.198434]  btrfs_delayed_update_inode+0x154/0x368 [btrfs]
        [117.199300]  btrfs_update_inode+0x108/0x1c8 [btrfs]
        [117.200122]  btrfs_dirty_inode+0xb4/0x140 [btrfs]
        [117.200937]  btrfs_update_time+0x8c/0xb0 [btrfs]
        [117.201754]  touch_atime+0x16c/0x1e0
        [117.201789]  filemap_read+0x674/0x728
        [117.201823]  btrfs_file_read_iter+0xf8/0x410 [btrfs]
        [117.202653]  vfs_read+0x2b6/0x498
        [117.203454]  ksys_read+0xa2/0x150
        [117.203473]  __s390x_sys_read+0x68/0x88
        [117.203495]  do_syscall+0x1c6/0x210
        [117.203517]  __do_syscall+0xc8/0xf0
        [117.203539]  system_call+0x70/0x98
      
        [117.203579] write to 0x000000017f587190 of 8 bytes by task 11 on cpu 0:
        [117.203604]  btrfs_block_rsv_release+0x2e8/0x578 [btrfs]
        [117.204432]  btrfs_delayed_inode_release_metadata+0x7c/0x1d0 [btrfs]
        [117.205259]  __btrfs_update_delayed_inode+0x37c/0x5e0 [btrfs]
        [117.206093]  btrfs_async_run_delayed_root+0x356/0x498 [btrfs]
        [117.206917]  btrfs_work_helper+0x160/0x7a0 [btrfs]
        [117.207738]  process_one_work+0x3b6/0x838
        [117.207768]  worker_thread+0x75e/0xb10
        [117.207797]  kthread+0x21a/0x230
        [117.207830]  __ret_from_fork+0x6c/0xb8
        [117.207861]  ret_from_fork+0xa/0x30
      
      So add a helper to get the reserved amount of a block reserve while
      holding the lock. The value may be not be up to date anymore when used by
      need_preemptive_reclaim() and btrfs_preempt_reclaim_metadata_space(), but
      that's ok since the worst it can do is cause more reclaim work do be done
      sooner rather than later. Reading the field while holding the lock instead
      of using the data_race() annotation is used in order to prevent load
      tearing.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e06cc894
    • Filipe Manana's avatar
      btrfs: send: don't issue unnecessary zero writes for trailing hole · 5897710b
      Filipe Manana authored
      If we have a sparse file with a trailing hole (from the last extent's end
      to i_size) and then create an extent in the file that ends before the
      file's i_size, then when doing an incremental send we will issue a write
      full of zeroes for the range that starts immediately after the new extent
      ends up to i_size. While this isn't incorrect because the file ends up
      with exactly the same data, it unnecessarily results in using extra space
      at the destination with one or more extents full of zeroes instead of
      having a hole. In same cases this results in using megabytes or even
      gigabytes of unnecessary space.
      
      Example, reproducer:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdh
         MNT=/mnt/sdh
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         # Create 1G sparse file.
         xfs_io -f -c "truncate 1G" $MNT/foobar
      
         # Create base snapshot.
         btrfs subvolume snapshot -r $MNT $MNT/mysnap1
      
         # Create send stream (full send) for the base snapshot.
         btrfs send -f /tmp/1.snap $MNT/mysnap1
      
         # Now write one extent at the beginning of the file and one somewhere
         # in the middle, leaving a gap between the end of this second extent
         # and the file's size.
         xfs_io -c "pwrite -S 0xab 0 128K" \
                -c "pwrite -S 0xcd 512M 128K" \
                $MNT/foobar
      
         # Now create a second snapshot which is going to be used for an
         # incremental send operation.
         btrfs subvolume snapshot -r $MNT $MNT/mysnap2
      
         # Create send stream (incremental send) for the second snapshot.
         btrfs send -p $MNT/mysnap1 -f /tmp/2.snap $MNT/mysnap2
      
         # Now recreate the filesystem by receiving both send streams and
         # verify we get the same content that the original filesystem had
         # and file foobar has only two extents with a size of 128K each.
         umount $MNT
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         btrfs receive -f /tmp/1.snap $MNT
         btrfs receive -f /tmp/2.snap $MNT
      
         echo -e "\nFile fiemap in the second snapshot:"
         # Should have:
         #
         # 128K extent at file range [0, 128K[
         # hole at file range [128K, 512M[
         # 128K extent file range [512M, 512M + 128K[
         # hole at file range [512M + 128K, 1G[
         xfs_io -r -c "fiemap -v" $MNT/mysnap2/foobar
      
         # File should be using 256K of data (two 128K extents).
         echo -e "\nSpace used by the file: $(du -h $MNT/mysnap2/foobar | cut -f 1)"
      
         umount $MNT
      
      Running the test, we can see with fiemap that we get an extent for the
      range [512M, 1G[, while in the source filesystem we have an extent for
      the range [512M, 512M + 128K[ and a hole for the rest of the file (the
      range [512M + 128K, 1G[):
      
         $ ./test.sh
         (...)
         File fiemap in the second snapshot:
         /mnt/sdh/mysnap2/foobar:
          EXT: FILE-OFFSET        BLOCK-RANGE        TOTAL FLAGS
            0: [0..255]:          26624..26879         256   0x0
            1: [256..1048575]:    hole             1048320
            2: [1048576..2097151]: 2156544..3205119 1048576   0x1
      
         Space used by the file: 513M
      
      This happens because once we finish processing an inode, at
      finish_inode_if_needed(), we always issue a hole (write operations full
      of zeros) if there's a gap between the end of the last processed extent
      and the file's size, even if that range is already a hole in the parent
      snapshot. Fix this by issuing the hole only if the range is not already
      a hole.
      
      After this change, running the test above, we get the expected layout:
      
         $ ./test.sh
         (...)
         File fiemap in the second snapshot:
         /mnt/sdh/mysnap2/foobar:
          EXT: FILE-OFFSET        BLOCK-RANGE      TOTAL FLAGS
            0: [0..255]:          26624..26879       256   0x0
            1: [256..1048575]:    hole             1048320
            2: [1048576..1048831]: 26880..27135       256   0x1
            3: [1048832..2097151]: hole             1048320
      
         Space used by the file: 256K
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 6.1+
      Reported-by: default avatarDorai Ashok S A <dash.btrfs@inix.me>
      Link: https://lore.kernel.org/linux-btrfs/c0bf7818-9c45-46a8-b3d3-513230d0c86e@inix.me/Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5897710b
    • David Sterba's avatar
      btrfs: dev-replace: properly validate device names · 9845664b
      David Sterba authored
      There's a syzbot report that device name buffers passed to device
      replace are not properly checked for string termination which could lead
      to a read out of bounds in getname_kernel().
      
      Add a helper that validates both source and target device name buffers.
      For devid as the source initialize the buffer to empty string in case
      something tries to read it later.
      
      This was originally analyzed and fixed in a different way by Edward Adam
      Davis (see links).
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000d1a1d1060cc9c5e7@google.com/
      Link: https://lore.kernel.org/linux-btrfs/tencent_44CA0665C9836EF9EEC80CB9E7E206DF5206@qq.com/
      CC: stable@vger.kernel.org # 4.19+
      CC: Edward Adam Davis <eadavis@qq.com>
      Reported-and-tested-by: syzbot+33f23b49ac24f986c9e8@syzkaller.appspotmail.com
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9845664b
    • Johannes Thumshirn's avatar
      btrfs: zoned: don't skip block group profile checks on conventional zones · 5906333c
      Johannes Thumshirn authored
      On a zoned filesystem with conventional zones, we're skipping the block
      group profile checks for the conventional zones.
      
      This allows converting a zoned filesystem's data block groups to RAID when
      all of the zones backing the chunk are on conventional zones.  But this
      will lead to problems, once we're trying to allocate chunks backed by
      sequential zones.
      
      So also check for conventional zones when loading a block group's profile
      on them.
      Reported-by: default avatarHAN Yuwei <hrx@bupt.moe>
      Link: https://lore.kernel.org/all/1ACD2E3643008A17+da260584-2c7f-432a-9e22-9d390aae84cc@bupt.moe/#tReviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5906333c
  2. 19 Feb, 2024 2 commits
    • Josef Bacik's avatar
      btrfs: fix deadlock with fiemap and extent locking · b0ad381f
      Josef Bacik authored
      While working on the patchset to remove extent locking I got a lockdep
      splat with fiemap and pagefaulting with my new extent lock replacement
      lock.
      
      This deadlock exists with our normal code, we just don't have lockdep
      annotations with the extent locking so we've never noticed it.
      
      Since we're copying the fiemap extent to user space on every iteration
      we have the chance of pagefaulting.  Because we hold the extent lock for
      the entire range we could mkwrite into a range in the file that we have
      mmap'ed.  This would deadlock with the following stack trace
      
      [<0>] lock_extent+0x28d/0x2f0
      [<0>] btrfs_page_mkwrite+0x273/0x8a0
      [<0>] do_page_mkwrite+0x50/0xb0
      [<0>] do_fault+0xc1/0x7b0
      [<0>] __handle_mm_fault+0x2fa/0x460
      [<0>] handle_mm_fault+0xa4/0x330
      [<0>] do_user_addr_fault+0x1f4/0x800
      [<0>] exc_page_fault+0x7c/0x1e0
      [<0>] asm_exc_page_fault+0x26/0x30
      [<0>] rep_movs_alternative+0x33/0x70
      [<0>] _copy_to_user+0x49/0x70
      [<0>] fiemap_fill_next_extent+0xc8/0x120
      [<0>] emit_fiemap_extent+0x4d/0xa0
      [<0>] extent_fiemap+0x7f8/0xad0
      [<0>] btrfs_fiemap+0x49/0x80
      [<0>] __x64_sys_ioctl+0x3e1/0xb50
      [<0>] do_syscall_64+0x94/0x1a0
      [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      I wrote an fstest to reproduce this deadlock without my replacement lock
      and verified that the deadlock exists with our existing locking.
      
      To fix this simply don't take the extent lock for the entire duration of
      the fiemap.  This is safe in general because we keep track of where we
      are when we're searching the tree, so if an ordered extent updates in
      the middle of our fiemap call we'll still emit the correct extents
      because we know what offset we were on before.
      
      The only place we maintain the lock is searching delalloc.  Since the
      delalloc stuff can change during writeback we want to lock the extent
      range so we have a consistent view of delalloc at the time we're
      checking to see if we need to set the delalloc flag.
      
      With this patch applied we no longer deadlock with my testcase.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0ad381f
    • Qu Wenruo's avatar
      btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size · e42b9d8b
      Qu Wenruo authored
      [BUG]
      With the following file extent layout, defrag would do unnecessary IO
      and result more on-disk space usage.
      
        # mkfs.btrfs -f $dev
        # mount $dev $mnt
        # xfs_io -f -c "pwrite 0 40m" $mnt/foobar
        # sync
        # xfs_io -f -c "pwrite 40m 16k" $mnt/foobar
        # sync
      
      Above command would lead to the following file extent layout:
      
              item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                      generation 7 type 1 (regular)
                      extent data disk byte 298844160 nr 41943040
                      extent data offset 0 nr 41943040 ram 41943040
                      extent compression 0 (none)
              item 7 key (257 EXTENT_DATA 41943040) itemoff 15763 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 13631488 nr 16384
                      extent data offset 0 nr 16384 ram 16384
                      extent compression 0 (none)
      
      Which is mostly fine. We can allow the final 16K to be merged with the
      previous 40M, but it's upon the end users' preference.
      
      But if we defrag the file using the default parameters, it would result
      worse file layout:
      
       # btrfs filesystem defrag $mnt/foobar
       # sync
      
              item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                      generation 7 type 1 (regular)
                      extent data disk byte 298844160 nr 41943040
                      extent data offset 0 nr 8650752 ram 41943040
                      extent compression 0 (none)
              item 7 key (257 EXTENT_DATA 8650752) itemoff 15763 itemsize 53
                      generation 9 type 1 (regular)
                      extent data disk byte 340787200 nr 33292288
                      extent data offset 0 nr 33292288 ram 33292288
                      extent compression 0 (none)
              item 8 key (257 EXTENT_DATA 41943040) itemoff 15710 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 13631488 nr 16384
                      extent data offset 0 nr 16384 ram 16384
                      extent compression 0 (none)
      
      Note the original 40M extent is still there, but a new 32M extent is
      created for no benefit at all.
      
      [CAUSE]
      There is an existing check to make sure we won't defrag a large enough
      extent (the threshold is by default 32M).
      
      But the check is using the length to the end of the extent:
      
      	range_len = em->len - (cur - em->start);
      
      	/* Skip too large extent */
      	if (range_len >= extent_thresh)
      		goto next;
      
      This means, for the first 8MiB of the extent, the range_len is always
      smaller than the default threshold, and would not be defragged.
      But after the first 8MiB, the remaining part would fit the requirement,
      and be defragged.
      
      Such different behavior inside the same extent caused the above problem,
      and we should avoid different defrag decision inside the same extent.
      
      [FIX]
      Instead of using @range_len, just use @em->len, so that we have a
      consistent decision among the same file extent.
      
      Now with this fix, we won't touch the extent, thus not making it any
      worse.
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 0cb5950f ("btrfs: fix deadlock when reserving space during defrag")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e42b9d8b
  3. 13 Feb, 2024 4 commits
    • Filipe Manana's avatar
      btrfs: don't refill whole delayed refs block reserve when starting transaction · 2f6397e4
      Filipe Manana authored
      Since commit 28270e25 ("btrfs: always reserve space for delayed refs
      when starting transaction") we started not only to reserve metadata space
      for the delayed refs a caller of btrfs_start_transaction() might generate
      but also to try to fully refill the delayed refs block reserve, because
      there are several case where we generate delayed refs and haven't reserved
      space for them, relying on the global block reserve. Relying too much on
      the global block reserve is not always safe, and can result in hitting
      -ENOSPC during transaction commits or worst, in rare cases, being unable
      to mount a filesystem that needs to do orphan cleanup or anything that
      requires modifying the filesystem during mount, and has no more
      unallocated space and the metadata space is nearly full. This was
      explained in detail in that commit's change log.
      
      However the gap between the reserved amount and the size of the delayed
      refs block reserve can be huge, so attempting to reserve space for such
      a gap can result in allocating many metadata block groups that end up
      not being used. After a recent patch, with the subject:
      
        "btrfs: add new unused block groups to the list of unused block groups"
      
      We started to add new block groups that are unused to the list of unused
      block groups, to avoid having them around for a very long time in case
      they are never used, because a block group is only added to the list of
      unused block groups when we deallocate the last extent or when mounting
      the filesystem and the block group has 0 bytes used. This is not a problem
      introduced by the commit mentioned earlier, it always existed as our
      metadata space reservations are, most of the time, pessimistic and end up
      not using all the space they reserved, so we can occasionally end up with
      one or two unused metadata block groups for a long period. However after
      that commit mentioned earlier, we are just more pessimistic in the
      metadata space reservations when starting a transaction and therefore the
      issue is more likely to happen.
      
      This however is not always enough because we might create unused metadata
      block groups when reserving metadata space at a high rate if there's
      always a gap in the delayed refs block reserve and the cleaner kthread
      isn't triggered often enough or is busy with other work (running delayed
      iputs, cleaning deleted roots, etc), not to mention the block group's
      allocated space is only usable for a new block group after the transaction
      used to remove it is committed.
      
      A user reported that he's getting a lot of allocated metadata block groups
      but the usage percentage of metadata space was very low compared to the
      total allocated space, specially after running a series of block group
      relocations.
      
      So for now stop trying to refill the gap in the delayed refs block reserve
      and reserve space only for the delayed refs we are expected to generate
      when starting a transaction.
      
      CC: stable@vger.kernel.org # 6.7+
      Reported-by: default avatarIvan Shapovalov <intelfx@intelfx.name>
      Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/Tested-by: default avatarIvan Shapovalov <intelfx@intelfx.name>
      Reported-by: default avatarHeddxh <g311571057@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f6397e4
    • Filipe Manana's avatar
      btrfs: zoned: fix chunk map leak when loading block group zone info · 88e81a67
      Filipe Manana authored
      At btrfs_load_block_group_zone_info() we never drop a reference on the
      chunk map we have looked up, therefore leaking a reference on it. So
      add the missing btrfs_free_chunk_map() at the end of the function.
      
      Fixes: 7dc66abb ("btrfs: use a dedicated data structure for chunk maps")
      Reported-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88e81a67
    • Filipe Manana's avatar
      btrfs: reject encoded write if inode has nodatasum flag set · 1bd96c92
      Filipe Manana authored
      Currently we allow an encoded write against inodes that have the NODATASUM
      flag set, either because they are NOCOW files or they were created while
      the filesystem was mounted with "-o nodatasum". This results in having
      compressed extents without corresponding checksums, which is a filesystem
      inconsistency reported by 'btrfs check'.
      
      For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
      this and 'btrfs check' errors out with:
      
         [1/7] checking root items
         [2/7] checking extents
         [3/7] checking free space tree
         [4/7] checking fs roots
         root 256 inode 257 errors 1040, bad file extent, some csum missing
         root 256 inode 258 errors 1040, bad file extent, some csum missing
         ERROR: errors found in fs roots
         (...)
      
      So reject encoded writes if the target inode has NODATASUM set.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bd96c92
    • Filipe Manana's avatar
      btrfs: don't reserve space for checksums when writing to nocow files · feefe1f4
      Filipe Manana authored
      Currently when doing a write to a file we always reserve metadata space
      for inserting data checksums. However we don't need to do it if we have
      a nodatacow file (-o nodatacow mount option or chattr +C) or if checksums
      are disabled (-o nodatasum mount option), as in that case we are only
      adding unnecessary pressure to metadata reservations.
      
      For example on x86_64, with the default node size of 16K, a 4K buffered
      write into a nodatacow file is reserving 655360 bytes of metadata space,
      as it's accounting for checksums. After this change, which stops reserving
      space for checksums if we have a nodatacow file or checksums are disabled,
      we only need to reserve 393216 bytes of metadata.
      
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      feefe1f4
  4. 09 Feb, 2024 4 commits
    • Filipe Manana's avatar
      btrfs: add new unused block groups to the list of unused block groups · 12c5128f
      Filipe Manana authored
      Space reservations for metadata are, most of the time, pessimistic as we
      reserve space for worst possible cases - where tree heights are at the
      maximum possible height (8), we need to COW every extent buffer in a tree
      path, need to split extent buffers, etc.
      
      For data, we generally reserve the exact amount of space we are going to
      allocate. The exception here is when using compression, in which case we
      reserve space matching the uncompressed size, as the compression only
      happens at writeback time and in the worst possible case we need that
      amount of space in case the data is not compressible.
      
      This means that when there's not available space in the corresponding
      space_info object, we may need to allocate a new block group, and then
      that block group might not be used after all. In this case the block
      group is never added to the list of unused block groups and ends up
      never being deleted - except if we unmount and mount again the fs, as
      when reading block groups from disk we add unused ones to the list of
      unused block groups (fs_info->unused_bgs). Otherwise a block group is
      only added to the list of unused block groups when we deallocate the
      last extent from it, so if no extent is ever allocated, the block group
      is kept around forever.
      
      This also means that if we have a bunch of tasks reserving space in
      parallel we can end up allocating many block groups that end up never
      being used or kept around for too long without being used, which has
      the potential to result in ENOSPC failures in case for example we over
      allocate too many metadata block groups and then end up in a state
      without enough unallocated space to allocate a new data block group.
      
      This is more likely to happen with metadata reservations as of kernel
      6.7, namely since commit 28270e25 ("btrfs: always reserve space for
      delayed refs when starting transaction"), because we started to always
      reserve space for delayed references when starting a transaction handle
      for a non-zero number of items, and also to try to reserve space to fill
      the gap between the delayed block reserve's reserved space and its size.
      
      So to avoid this, when finishing the creation a new block group, add the
      block group to the list of unused block groups if it's still unused at
      that time. This way the next time the cleaner kthread runs, it will delete
      the block group if it's still unused and not needed to satisfy existing
      space reservations.
      Reported-by: default avatarIvan Shapovalov <intelfx@intelfx.name>
      Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
      CC: stable@vger.kernel.org # 6.7+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12c5128f
    • Filipe Manana's avatar
      btrfs: do not delete unused block group if it may be used soon · f4a9f219
      Filipe Manana authored
      Before deleting a block group that is in the list of unused block groups
      (fs_info->unused_bgs), we check if the block group became used before
      deleting it, as extents from it may have been allocated after it was added
      to the list.
      
      However even if the block group was not yet used, there may be tasks that
      have only reserved space and have not yet allocated extents, and they
      might be relying on the availability of the unused block group in order
      to allocate extents. The reservation works first by increasing the
      "bytes_may_use" field of the corresponding space_info object (which may
      first require flushing delayed items, allocating a new block group, etc),
      and only later a task does the actual allocation of extents.
      
      For metadata we usually don't end up using all reserved space, as we are
      pessimistic and typically account for the worst cases (need to COW every
      single node in a path of a tree at maximum possible height, etc). For
      data we usually reserve the exact amount of space we're going to allocate
      later, except when using compression where we always reserve space based
      on the uncompressed size, as compression is only triggered when writeback
      starts so we don't know in advance how much space we'll actually need, or
      if the data is compressible.
      
      So don't delete an unused block group if the total size of its space_info
      object minus the block group's size is less then the sum of used space and
      space that may be used (space_info->bytes_may_use), as that means we have
      tasks that reserved space and may need to allocate extents from the block
      group. In this case, besides skipping the deletion, re-add the block group
      to the list of unused block groups so that it may be reconsidered later,
      in case the tasks that reserved space end up not needing to allocate
      extents from it.
      
      Allowing the deletion of the block group while we have reserved space, can
      result in tasks failing to allocate metadata extents (-ENOSPC) while under
      a transaction handle, resulting in a transaction abort, or failure during
      writeback for the case of data extents.
      
      CC: stable@vger.kernel.org # 6.0+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4a9f219
    • Filipe Manana's avatar
      btrfs: add and use helper to check if block group is used · 1693d544
      Filipe Manana authored
      Add a helper function to determine if a block group is being used and make
      use of it at btrfs_delete_unused_bgs(). This helper will also be used in
      future code changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1693d544
    • Josef Bacik's avatar
      btrfs: don't drop extent_map for free space inode on write error · 5571e41e
      Josef Bacik authored
      While running the CI for an unrelated change I hit the following panic
      with generic/648 on btrfs_holes_spacecache.
      
      assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/extent_io.c:1385!
      invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G        W          6.8.0-rc2+ #1
      RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0
      Call Trace:
       <TASK>
       extent_write_cache_pages+0x2ac/0x8f0
       extent_writepages+0x87/0x110
       do_writepages+0xd5/0x1f0
       filemap_fdatawrite_wbc+0x63/0x90
       __filemap_fdatawrite_range+0x5c/0x80
       btrfs_fdatawrite_range+0x1f/0x50
       btrfs_write_out_cache+0x507/0x560
       btrfs_write_dirty_block_groups+0x32a/0x420
       commit_cowonly_roots+0x21b/0x290
       btrfs_commit_transaction+0x813/0x1360
       btrfs_sync_file+0x51a/0x640
       __x64_sys_fdatasync+0x52/0x90
       do_syscall_64+0x9c/0x190
       entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      This happens because we fail to write out the free space cache in one
      instance, come back around and attempt to write it again.  However on
      the second pass through we go to call btrfs_get_extent() on the inode to
      get the extent mapping.  Because this is a new block group, and with the
      free space inode we always search the commit root to avoid deadlocking
      with the tree, we find nothing and return a EXTENT_MAP_HOLE for the
      requested range.
      
      This happens because the first time we try to write the space cache out
      we hit an error, and on an error we drop the extent mapping.  This is
      normal for normal files, but the free space cache inode is special.  We
      always expect the extent map to be correct.  Thus the second time
      through we end up with a bogus extent map.
      
      Since we're deprecating this feature, the most straightforward way to
      fix this is to simply skip dropping the extent map range for this failed
      range.
      
      I shortened the test by using error injection to stress the area to make
      it easier to reproduce.  With this patch in place we no longer panic
      with my error injection test.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5571e41e
  5. 31 Jan, 2024 4 commits
    • Qu Wenruo's avatar
      btrfs: do not ASSERT() if the newly created subvolume already got read · e03ee2fe
      Qu Wenruo authored
      [BUG]
      There is a syzbot crash, triggered by the ASSERT() during subvolume
      creation:
      
       assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/disk-io.c:1319!
       invalid opcode: 0000 [#1] PREEMPT SMP KASAN
       RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60
        <TASK>
        btrfs_get_new_fs_root+0xd3/0xf0
        create_subvol+0xd02/0x1650
        btrfs_mksubvol+0xe95/0x12b0
        __btrfs_ioctl_snap_create+0x2f9/0x4f0
        btrfs_ioctl_snap_create+0x16b/0x200
        btrfs_ioctl+0x35f0/0x5cf0
        __x64_sys_ioctl+0x19d/0x210
        do_syscall_64+0x3f/0xe0
        entry_SYSCALL_64_after_hwframe+0x63/0x6b
       ---[ end trace 0000000000000000 ]---
      
      [CAUSE]
      During create_subvol(), after inserting root item for the newly created
      subvolume, we would trigger btrfs_get_new_fs_root() to get the
      btrfs_root of that subvolume.
      
      The idea here is, we have preallocated an anonymous device number for
      the subvolume, thus we can assign it to the new subvolume.
      
      But there is really nothing preventing things like backref walk to read
      the new subvolume.
      If that happens before we call btrfs_get_new_fs_root(), the subvolume
      would be read out, with a new anonymous device number assigned already.
      
      In that case, we would trigger ASSERT(), as we really expect no one to
      read out that subvolume (which is not yet accessible from the fs).
      But things like backref walk is still possible to trigger the read on
      the subvolume.
      
      Thus our assumption on the ASSERT() is not correct in the first place.
      
      [FIX]
      Fix it by removing the ASSERT(), and just free the @anon_dev, reset it
      to 0, and continue.
      
      If the subvolume tree is read out by something else, it should have
      already get a new anon_dev assigned thus we only need to free the
      preallocated one.
      Reported-by: default avatarChenyuan Yang <chenyuan0y@gmail.com>
      Fixes: 2dfb1e43 ("btrfs: preallocate anon block device at first phase of snapshot creation")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e03ee2fe
    • Boris Burkov's avatar
      btrfs: forbid deleting live subvol qgroup · a8df3561
      Boris Burkov authored
      If a subvolume still exists, forbid deleting its qgroup 0/subvolid.
      This behavior generally leads to incorrect behavior in squotas and
      doesn't have a legitimate purpose.
      
      Fixes: cecbb533 ("btrfs: record simple quota deltas in delayed refs")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8df3561
    • Boris Burkov's avatar
      btrfs: forbid creating subvol qgroups · 0c309d66
      Boris Burkov authored
      Creating a qgroup 0/subvolid leads to various races and it isn't
      helpful, because you can't specify a subvol id when creating a subvol,
      so you can't be sure it will be the right one. Any requirements on the
      automatic subvol can be gratified by using a higher level qgroup and the
      inheritance parameters of subvol creation.
      
      Fixes: cecbb533 ("btrfs: record simple quota deltas in delayed refs")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c309d66
    • David Sterba's avatar
      btrfs: send: return EOPNOTSUPP on unknown flags · f884a9f9
      David Sterba authored
      When some ioctl flags are checked we return EOPNOTSUPP, like for
      BTRFS_SCRUB_SUPPORTED_FLAGS, BTRFS_SUBVOL_CREATE_ARGS_MASK or fallocate
      modes. The EINVAL is supposed to be for a supported but invalid
      values or combination of options. Fix that when checking send flags so
      it's consistent with the rest.
      
      CC: stable@vger.kernel.org # 4.14+
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5rryOLzp3EKq8RTbjMHMHeaJubfpsVLF6H4qJnKCUR1w@mail.gmail.com/Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f884a9f9
  6. 18 Jan, 2024 9 commits
    • Qu Wenruo's avatar
      btrfs: scrub: limit RST scrub to chunk boundary · 7f2d219e
      Qu Wenruo authored
      [BUG]
      If there is an extent beyond chunk boundary, currently RST scrub would
      error out.
      
      [CAUSE]
      In scrub_submit_extent_sector_read(), we completely rely on
      extent_sector_bitmap, which is populated using extent tree.
      
      The extent tree can be corrupted that there is an extent item beyond a
      chunk.
      
      In that case, RST scrub would fail and error out.
      
      [FIX]
      Despite the extent_sector_bitmap usage, also limit the read to chunk
      boundary.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f2d219e
    • Qu Wenruo's avatar
      btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned · f546c428
      Qu Wenruo authored
      [BUG]
      There is a bug report that, on a ext4-converted btrfs, scrub leads to
      various problems, including:
      
      - "unable to find chunk map" errors
        BTRFS info (device vdb): scrub: started on devid 1
        BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
        BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
      
        This would lead to unrepariable errors.
      
      - Use-after-free KASAN reports:
        ==================================================================
        BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
        Read of size 8 at addr ffff8881013c9040 by task btrfs/909
        CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
        Call Trace:
         <TASK>
         dump_stack_lvl+0x43/0x60
         print_report+0xcf/0x640
         kasan_report+0xa6/0xd0
         __blk_rq_map_sg+0x18f/0x7c0
         virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
         virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
         blk_mq_flush_plug_list.part.0+0x780/0x860
         __blk_flush_plug+0x1ba/0x220
         blk_finish_plug+0x3b/0x60
         submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         __x64_sys_ioctl+0xbd/0x100
         do_syscall_64+0x5d/0xe0
         entry_SYSCALL_64_after_hwframe+0x63/0x6b
        RIP: 0033:0x7f47e5e0952b
      
      - Crash, mostly due to above use-after-free
      
      [CAUSE]
      The converted fs has the following data chunk layout:
      
          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
              length 86016 owner 2 stripe_len 65536 type DATA|single
      
      For above logical bytenr 2214744064, it's at the chunk end
      (2214658048 + 86016 = 2214744064).
      
      This means btrfs_submit_bio() would split the bio, and trigger endio
      function for both of the two halves.
      
      However scrub_submit_initial_read() would only expect the endio function
      to be called once, not any more.
      This means the first endio function would already free the bbio::bio,
      leaving the bvec freed, thus the 2nd endio call would lead to
      use-after-free.
      
      [FIX]
      - Make sure scrub_read_endio() only updates bits in its range
        Since we may read less than 64K at the end of the chunk, we should not
        touch the bits beyond chunk boundary.
      
      - Make sure scrub_submit_initial_read() only to read the chunk range
        This is done by calculating the real number of sectors we need to
        read, and add sector-by-sector to the bio.
      
      Thankfully the scrub read repair path won't need extra fixes:
      
      - scrub_stripe_submit_repair_read()
        With above fixes, we won't update error bit for range beyond chunk,
        thus scrub_stripe_submit_repair_read() should never submit any read
        beyond the chunk.
      Reported-by: default avatarRongrong <i@rong.moe>
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      Tested-by: default avatarRongrong <i@rong.moe>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f546c428
    • Josef Bacik's avatar
      btrfs: don't unconditionally call folio_start_writeback in subpage · 1e61b8c6
      Josef Bacik authored
      In the normal case we check if a page is under writeback and skip it
      before we attempt to begin writeback.
      
      The exception is subpage metadata writes, where we know we don't have an
      eb under writeback and we're doing it one eb at a time.  Since
      b5612c36 ("mm: return void from folio_start_writeback() and related
      functions") we now will BUG_ON() if we call folio_start_writeback()
      on a folio that's already under writeback.  Previously
      folio_start_writeback() would bail if writeback was already started.
      
      Fix this in the subpage code by checking if we have writeback set and
      skipping it if we do.  This fixes the panic we were seeing on subpage.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e61b8c6
    • Josef Bacik's avatar
      btrfs: use the original mount's mount options for the legacy reconfigure · 2018ef1d
      Josef Bacik authored
      btrfs/330, which tests our old trick to allow
      
      mount -o ro,subvol=/x /dev/sda1 /foo
      mount -o rw,subvol=/y /dev/sda1 /bar
      
      fails on the block group tree.  This is because we aren't preserving the
      mount options for what is essentially a remount, and thus we're ending
      up without the FREE_SPACE_TREE mount option, which triggers our free
      space tree delete codepath.  This isn't possible with the block group
      tree and thus it falls over.
      
      Fix this by making sure we copy the existing mount options for the
      existing fs mount over in this case.
      
      Fixes: f044b318 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
      Reviewed-by: default avatarNeal Gompa <neal@gompa.dev>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2018ef1d
    • David Sterba's avatar
      btrfs: don't warn if discard range is not aligned to sector · a208b3f1
      David Sterba authored
      There's a warning in btrfs_issue_discard() when the range is not aligned
      to 512 bytes, originally added in 4d89d377 ("btrfs:
      btrfs_issue_discard ensure offset/length are aligned to sector
      boundaries"). We can't do sub-sector writes anyway so the adjustment is
      the only thing that we can do and the warning is unnecessary.
      
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: syzbot+4a4f1eba14eb5c3417d1@syzkaller.appspotmail.com
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a208b3f1
    • Chung-Chiang Cheng's avatar
      btrfs: tree-checker: fix inline ref size in error messages · f398e70d
      Chung-Chiang Cheng authored
      The error message should accurately reflect the size rather than the
      type.
      
      Fixes: f82d1c7c ("btrfs: tree-checker: Add EXTENT_ITEM and METADATA_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChung-Chiang Cheng <cccheng@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f398e70d
    • Qu Wenruo's avatar
      btrfs: zstd: fix and simplify the inline extent decompression · 1e7f6def
      Qu Wenruo authored
      [BUG]
      If we have a filesystem with 4k sectorsize, and an inlined compressed
      extent created like this:
      
      	item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
      		generation 8 transid 8 size 4096 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
      		index 2 namelen 14 name: source_inlined
      	item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
      		generation 8 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
      
      Then trying to reflink that extent in an aarch64 system with 64K page
      size, the reflink would just fail:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        XFS_IOC_CLONE_RANGE: Input/output error
      
      [CAUSE]
      In zstd_decompress(), we didn't treat @start_byte as just a page offset,
      but also use it as an indicator on whether we should error out, without
      any proper explanation (this is copied from other decompression code).
      
      In reality, for subpage cases, although @start_byte can be non-zero,
      we should never switch input/output buffer nor error out, since the whole
      input/output buffer should never exceed one sector, thus we should not
      need to do any buffer switch.
      
      Thus the current code using @start_byte as a condition to switch
      input/output buffer or finish the decompression is completely incorrect.
      
      [FIX]
      The fix involves several modification:
      
      - Rename @start_byte to @dest_pgoff to properly express its meaning
      
      - Use @sectorsize other than PAGE_SIZE to properly initialize the
        output buffer size
      
      - Use correct destination offset inside the destination page
      
      - Simplify the main loop
        Since the input/output buffer should never switch, we only need one
        zstd_decompress_stream() call.
      
      - Consider early end as an error
      
      After the fix, even on 64K page sized aarch64, above reflink now
      works as expected:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        linked 4096/4096 bytes at offset 61440
      
      And results the correct file layout:
      
      	item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
      		generation 10 transid 10 size 65536 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
      		index 3 namelen 4 name: dest
      	item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
      		location key (0 UNKNOWN.0 0) type XATTR
      		transid 10 data_len 37 name_len 16
      		name: security.selinux
      		data unconfined_u:object_r:unlabeled_t:s0
      	item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
      		generation 10 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      		extent compression 0 (none)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e7f6def
    • Qu Wenruo's avatar
      btrfs: lzo: fix and simplify the inline extent decompression · 6a69631e
      Qu Wenruo authored
      [BUG]
      If we have a filesystem with 4k sectorsize, and an inlined compressed
      extent created like this:
      
      	item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
      		generation 8 transid 8 size 4096 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
      		index 2 namelen 14 name: source_inlined
      	item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
      		generation 8 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 2 (lzo)
      
      Then trying to reflink that extent in an aarch64 system with 64K page
      size, the reflink would just fail:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        XFS_IOC_CLONE_RANGE: Input/output error
      
      [CAUSE]
      In zlib_decompress(), we didn't treat @start_byte as just a page offset,
      but also use it as an indicator on whether we should error out, without
      any proper explanation (this is from the very beginning of btrfs).
      
      In reality, for subpage cases, although @start_byte can be non-zero,
      we should never switch input/output buffer nor error out, since the whole
      input/output buffer should never exceed one sector.
      
      Note: The above assumption is only not true if we're going to support
      multi-page sectorsize.
      
      Thus the current code using @start_byte as a condition to switch
      input/output buffer or finish the decompression is completely incorrect.
      
      [FIX]
      The fix involves several modifications:
      
      - Rename @start_byte to @dest_pgoff to properly express its meaning
      
      - Use @sectorsize other than PAGE_SIZE to properly initialize the
        output buffer size
      
      - Use correct destination offset inside the destination page
      
      - Use memcpy_to_page() to copy the contents to the destination page
      
      - Use memzero_page() to zero out the tailing part
      
      - Consider early end as an error
      
      After the fix, even on 64K page sized aarch64, above reflink now
      works as expected:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        linked 4096/4096 bytes at offset 61440
      
      And results the correct file layout:
      
      	item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
      		generation 10 transid 10 size 65536 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
      		index 3 namelen 4 name: dest
      	item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
      		location key (0 UNKNOWN.0 0) type XATTR
      		transid 10 data_len 37 name_len 16
      		name: security.selinux
      		data unconfined_u:object_r:unlabeled_t:s0
      	item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
      		generation 10 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      		extent compression 0 (none)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a69631e
    • Qu Wenruo's avatar
      btrfs: zlib: fix and simplify the inline extent decompression · 2c25716d
      Qu Wenruo authored
      [BUG]
      
      If we have a filesystem with 4k sectorsize, and an inlined compressed
      extent created like this:
      
      	item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
      		generation 8 transid 8 size 4096 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
      		index 2 namelen 14 name: source_inlined
      	item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
      		generation 8 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
      
      Which has an inline compressed extent at file offset 0, and its
      decompressed size is 4K, allowing us to reflink that 4K range to another
      location (which will not be compressed).
      
      If we do such reflink on a subpage system, it would fail like this:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        XFS_IOC_CLONE_RANGE: Input/output error
      
      [CAUSE]
      In zlib_decompress(), we didn't treat @start_byte as just a page offset,
      but also use it as an indicator on whether we should switch our output
      buffer.
      
      In reality, for subpage cases, although @start_byte can be non-zero,
      we should never switch input/output buffer, since the whole input/output
      buffer should never exceed one sector.
      
      Note: The above assumption is only not true if we're going to support
      multi-page sectorsize.
      
      Thus the current code using @start_byte as a condition to switch
      input/output buffer or finish the decompression is completely incorrect.
      
      [FIX]
      The fix involves several modifications:
      
      - Rename @start_byte to @dest_pgoff to properly express its meaning
      
      - Add an extra ASSERT() inside btrfs_decompress() to make sure the
        input/output size never exceeds one sector.
      
      - Use Z_FINISH flag to make sure the decompression happens in one go
      
      - Remove the loop needed to switch input/output buffers
      
      - Use correct destination offset inside the destination page
      
      - Consider early end as an error
      
      After the fix, even on 64K page sized aarch64, above reflink now
      works as expected:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        linked 4096/4096 bytes at offset 61440
      
      And resulted a correct file layout:
      
      	item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
      		generation 10 transid 10 size 65536 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
      		index 3 namelen 4 name: dest
      	item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
      		location key (0 UNKNOWN.0 0) type XATTR
      		transid 10 data_len 37 name_len 16
      		name: security.selinux
      		data unconfined_u:object_r:unlabeled_t:s0
      	item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
      		generation 10 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      		extent compression 0 (none)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c25716d
  7. 12 Jan, 2024 9 commits
    • Qu Wenruo's avatar
      btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args · 173431b2
      Qu Wenruo authored
      Add extra sanity check for btrfs_ioctl_defrag_range_args::flags.
      
      This is not really to enhance fuzzing tests, but as a preparation for
      future expansion on btrfs_ioctl_defrag_range_args.
      
      In the future we're going to add new members, allowing more fine tuning
      for btrfs defrag.  Without the -ENONOTSUPP error, there would be no way
      to detect if the kernel supports those new defrag features.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      173431b2
    • Omar Sandoval's avatar
      btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted · 3324d054
      Omar Sandoval authored
      Sweet Tea spotted a race between subvolume deletion and snapshotting
      that can result in the root item for the snapshot having the
      BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
      
      Thread 1                                      | Thread 2
      ----------------------------------------------|----------
      btrfs_delete_subvolume                        |
        btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
                                                    |btrfs_mksubvol
                                                    |  down_read(subvol_sem)
                                                    |  create_snapshot
                                                    |    ...
                                                    |    create_pending_snapshot
                                                    |      copy root item from source
        down_write(subvol_sem)                      |
      
      This flag is only checked in send and swap activate, which this would
      cause to fail mysteriously.
      
      create_snapshot() now checks the root refs to reject a deleted
      subvolume, so we can fix this by locking subvol_sem earlier so that the
      BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3324d054
    • Omar Sandoval's avatar
      btrfs: don't abort filesystem when attempting to snapshot deleted subvolume · 7081929a
      Omar Sandoval authored
      If the source file descriptor to the snapshot ioctl refers to a deleted
      subvolume, we get the following abort:
      
        BTRFS: Transaction aborted (error -2)
        WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs]
        Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c
        CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
        RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs]
        RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027
        RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840
        RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998
        R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe
        R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80
        FS:  00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0
        Call Trace:
         <TASK>
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? __warn+0x81/0x130
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? report_bug+0x171/0x1a0
         ? handle_bug+0x3a/0x70
         ? exc_invalid_op+0x17/0x70
         ? asm_exc_invalid_op+0x1a/0x20
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         create_pending_snapshots+0x92/0xc0 [btrfs]
         btrfs_commit_transaction+0x66b/0xf40 [btrfs]
         btrfs_mksubvol+0x301/0x4d0 [btrfs]
         btrfs_mksnapshot+0x80/0xb0 [btrfs]
         __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs]
         btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs]
         btrfs_ioctl+0x8a6/0x2650 [btrfs]
         ? kmem_cache_free+0x22/0x340
         ? do_sys_openat2+0x97/0xe0
         __x64_sys_ioctl+0x97/0xd0
         do_syscall_64+0x46/0xf0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
        RIP: 0033:0x7fe20abe83af
        RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af
        RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58
         </TASK>
        ---[ end trace 0000000000000000 ]---
        BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry
        BTRFS info (device vdc: state EA): forced readonly
        BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction.
        BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry
      
      This happens because create_pending_snapshot() initializes the new root
      item as a copy of the source root item. This includes the refs field,
      which is 0 for a deleted subvolume. The call to btrfs_insert_root()
      therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then
      finds the root and returns -ENOENT if refs == 0, which causes
      create_pending_snapshot() to abort.
      
      Fix it by checking the source root's refs before attempting the
      snapshot, but after locking subvol_sem to avoid racing with deletion.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7081929a
    • Naohiro Aota's avatar
      btrfs: zoned: fix lock ordering in btrfs_zone_activate() · b18f3b60
      Naohiro Aota authored
      The btrfs CI reported a lockdep warning as follows by running generic
      generic/129.
      
         WARNING: possible circular locking dependency detected
         6.7.0-rc5+ #1 Not tainted
         ------------------------------------------------------
         kworker/u5:5/793427 is trying to acquire lock:
         ffff88813256d028 (&cache->lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x5e/0x130
         but task is already holding lock:
         ffff88810a23a318 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x34/0x130
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
         -> #1 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}:
         ...
         -> #0 (&cache->lock){+.+.}-{2:2}:
         ...
      
      This is because we take fs_info->zone_active_bgs_lock after a block_group's
      lock in btrfs_zone_activate() while doing the opposite in other places.
      
      Fix the issue by expanding the fs_info->zone_active_bgs_lock's critical
      section and taking it before a block_group's lock.
      
      Fixes: a7e1ac7b ("btrfs: zoned: reserve zones for an active metadata/system block group")
      CC: stable@vger.kernel.org # 6.6
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b18f3b60
    • Naohiro Aota's avatar
      btrfs: fix unbalanced unlock of mapping_tree_lock · d967c914
      Naohiro Aota authored
      The error path of btrfs_get_chunk_map() releases
      fs_info->mapping_tree_lock. But, it is taken and released in
      btrfs_find_chunk_map(). So, there is no need to do so.
      
      Fixes: 7dc66abb ("btrfs: use a dedicated data structure for chunk maps")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d967c914
    • Fedor Pchelkin's avatar
      btrfs: ref-verify: free ref cache before clearing mount opt · f03e274a
      Fedor Pchelkin authored
      As clearing REF_VERIFY mount option indicates there were some errors in a
      ref-verify process, a ref cache is not relevant anymore and should be
      freed.
      
      btrfs_free_ref_cache() requires REF_VERIFY option being set so call
      it just before clearing the mount option.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Reported-by: syzbot+be14ed7728594dc8bd42@syzkaller.appspotmail.com
      Fixes: fd708b81 ("Btrfs: add a extent ref verify tool")
      CC: stable@vger.kernel.org # 5.4+
      Closes: https://lore.kernel.org/lkml/000000000000e5a65c05ee832054@google.com/
      Reported-by: syzbot+c563a3c79927971f950f@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/lkml/0000000000007fe09705fdc6086c@google.com/Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f03e274a
    • Dmitry Antipov's avatar
      btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send() · 6ff09b6b
      Dmitry Antipov authored
      When compiling with gcc version 14.0.0 20231220 (experimental)
      and W=1, I've noticed the following warning:
      
      fs/btrfs/send.c: In function 'btrfs_ioctl_send':
      fs/btrfs/send.c:8208:44: warning: 'kvcalloc' sizes specified with 'sizeof'
      in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
       8208 |         sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
            |                                            ^
      
      Since 'n' and 'size' arguments of 'kvcalloc()' are multiplied to
      calculate the final size, their actual order doesn't affect the result
      and so this is not a bug. But it's still worth to fix it.
      Signed-off-by: default avatarDmitry Antipov <dmantipov@yandex.ru>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ff09b6b
    • Naohiro Aota's avatar
      btrfs: zoned: optimize hint byte for zoned allocator · 02444f2a
      Naohiro Aota authored
      Writing sequentially to a huge file on btrfs on a SMR HDD revealed a
      decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes).
      
      The performance goes down because of increased latency of the extent
      allocation, which is induced by a traversing of a lot of full block groups.
      
      So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group
      with sufficient size from the active block group list, which does not
      contain full block groups.
      
      After applying the patch, the performance is maintained well.
      
      Fixes: 2eda5708 ("btrfs: zoned: implement sequential extent allocation")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02444f2a
    • Naohiro Aota's avatar
      btrfs: zoned: factor out prepare_allocation_zoned() · b271fee9
      Naohiro Aota authored
      Factor out prepare_allocation_zoned() for further extension. While at
      it, optimize the if-branch a bit.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b271fee9
  8. 15 Dec, 2023 4 commits