1. 31 Jan, 2024 2 commits
  2. 18 Jan, 2024 9 commits
    • Qu Wenruo's avatar
      btrfs: scrub: limit RST scrub to chunk boundary · 7f2d219e
      Qu Wenruo authored
      [BUG]
      If there is an extent beyond chunk boundary, currently RST scrub would
      error out.
      
      [CAUSE]
      In scrub_submit_extent_sector_read(), we completely rely on
      extent_sector_bitmap, which is populated using extent tree.
      
      The extent tree can be corrupted that there is an extent item beyond a
      chunk.
      
      In that case, RST scrub would fail and error out.
      
      [FIX]
      Despite the extent_sector_bitmap usage, also limit the read to chunk
      boundary.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f2d219e
    • Qu Wenruo's avatar
      btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned · f546c428
      Qu Wenruo authored
      [BUG]
      There is a bug report that, on a ext4-converted btrfs, scrub leads to
      various problems, including:
      
      - "unable to find chunk map" errors
        BTRFS info (device vdb): scrub: started on devid 1
        BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
        BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
      
        This would lead to unrepariable errors.
      
      - Use-after-free KASAN reports:
        ==================================================================
        BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
        Read of size 8 at addr ffff8881013c9040 by task btrfs/909
        CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
        Call Trace:
         <TASK>
         dump_stack_lvl+0x43/0x60
         print_report+0xcf/0x640
         kasan_report+0xa6/0xd0
         __blk_rq_map_sg+0x18f/0x7c0
         virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
         virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
         blk_mq_flush_plug_list.part.0+0x780/0x860
         __blk_flush_plug+0x1ba/0x220
         blk_finish_plug+0x3b/0x60
         submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
         __x64_sys_ioctl+0xbd/0x100
         do_syscall_64+0x5d/0xe0
         entry_SYSCALL_64_after_hwframe+0x63/0x6b
        RIP: 0033:0x7f47e5e0952b
      
      - Crash, mostly due to above use-after-free
      
      [CAUSE]
      The converted fs has the following data chunk layout:
      
          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
              length 86016 owner 2 stripe_len 65536 type DATA|single
      
      For above logical bytenr 2214744064, it's at the chunk end
      (2214658048 + 86016 = 2214744064).
      
      This means btrfs_submit_bio() would split the bio, and trigger endio
      function for both of the two halves.
      
      However scrub_submit_initial_read() would only expect the endio function
      to be called once, not any more.
      This means the first endio function would already free the bbio::bio,
      leaving the bvec freed, thus the 2nd endio call would lead to
      use-after-free.
      
      [FIX]
      - Make sure scrub_read_endio() only updates bits in its range
        Since we may read less than 64K at the end of the chunk, we should not
        touch the bits beyond chunk boundary.
      
      - Make sure scrub_submit_initial_read() only to read the chunk range
        This is done by calculating the real number of sectors we need to
        read, and add sector-by-sector to the bio.
      
      Thankfully the scrub read repair path won't need extra fixes:
      
      - scrub_stripe_submit_repair_read()
        With above fixes, we won't update error bit for range beyond chunk,
        thus scrub_stripe_submit_repair_read() should never submit any read
        beyond the chunk.
      Reported-by: default avatarRongrong <i@rong.moe>
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      Tested-by: default avatarRongrong <i@rong.moe>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f546c428
    • Josef Bacik's avatar
      btrfs: don't unconditionally call folio_start_writeback in subpage · 1e61b8c6
      Josef Bacik authored
      In the normal case we check if a page is under writeback and skip it
      before we attempt to begin writeback.
      
      The exception is subpage metadata writes, where we know we don't have an
      eb under writeback and we're doing it one eb at a time.  Since
      b5612c36 ("mm: return void from folio_start_writeback() and related
      functions") we now will BUG_ON() if we call folio_start_writeback()
      on a folio that's already under writeback.  Previously
      folio_start_writeback() would bail if writeback was already started.
      
      Fix this in the subpage code by checking if we have writeback set and
      skipping it if we do.  This fixes the panic we were seeing on subpage.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e61b8c6
    • Josef Bacik's avatar
      btrfs: use the original mount's mount options for the legacy reconfigure · 2018ef1d
      Josef Bacik authored
      btrfs/330, which tests our old trick to allow
      
      mount -o ro,subvol=/x /dev/sda1 /foo
      mount -o rw,subvol=/y /dev/sda1 /bar
      
      fails on the block group tree.  This is because we aren't preserving the
      mount options for what is essentially a remount, and thus we're ending
      up without the FREE_SPACE_TREE mount option, which triggers our free
      space tree delete codepath.  This isn't possible with the block group
      tree and thus it falls over.
      
      Fix this by making sure we copy the existing mount options for the
      existing fs mount over in this case.
      
      Fixes: f044b318 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
      Reviewed-by: default avatarNeal Gompa <neal@gompa.dev>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2018ef1d
    • David Sterba's avatar
      btrfs: don't warn if discard range is not aligned to sector · a208b3f1
      David Sterba authored
      There's a warning in btrfs_issue_discard() when the range is not aligned
      to 512 bytes, originally added in 4d89d377 ("btrfs:
      btrfs_issue_discard ensure offset/length are aligned to sector
      boundaries"). We can't do sub-sector writes anyway so the adjustment is
      the only thing that we can do and the warning is unnecessary.
      
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: syzbot+4a4f1eba14eb5c3417d1@syzkaller.appspotmail.com
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a208b3f1
    • Chung-Chiang Cheng's avatar
      btrfs: tree-checker: fix inline ref size in error messages · f398e70d
      Chung-Chiang Cheng authored
      The error message should accurately reflect the size rather than the
      type.
      
      Fixes: f82d1c7c ("btrfs: tree-checker: Add EXTENT_ITEM and METADATA_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChung-Chiang Cheng <cccheng@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f398e70d
    • Qu Wenruo's avatar
      btrfs: zstd: fix and simplify the inline extent decompression · 1e7f6def
      Qu Wenruo authored
      [BUG]
      If we have a filesystem with 4k sectorsize, and an inlined compressed
      extent created like this:
      
      	item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
      		generation 8 transid 8 size 4096 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
      		index 2 namelen 14 name: source_inlined
      	item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
      		generation 8 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
      
      Then trying to reflink that extent in an aarch64 system with 64K page
      size, the reflink would just fail:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        XFS_IOC_CLONE_RANGE: Input/output error
      
      [CAUSE]
      In zstd_decompress(), we didn't treat @start_byte as just a page offset,
      but also use it as an indicator on whether we should error out, without
      any proper explanation (this is copied from other decompression code).
      
      In reality, for subpage cases, although @start_byte can be non-zero,
      we should never switch input/output buffer nor error out, since the whole
      input/output buffer should never exceed one sector, thus we should not
      need to do any buffer switch.
      
      Thus the current code using @start_byte as a condition to switch
      input/output buffer or finish the decompression is completely incorrect.
      
      [FIX]
      The fix involves several modification:
      
      - Rename @start_byte to @dest_pgoff to properly express its meaning
      
      - Use @sectorsize other than PAGE_SIZE to properly initialize the
        output buffer size
      
      - Use correct destination offset inside the destination page
      
      - Simplify the main loop
        Since the input/output buffer should never switch, we only need one
        zstd_decompress_stream() call.
      
      - Consider early end as an error
      
      After the fix, even on 64K page sized aarch64, above reflink now
      works as expected:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        linked 4096/4096 bytes at offset 61440
      
      And results the correct file layout:
      
      	item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
      		generation 10 transid 10 size 65536 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
      		index 3 namelen 4 name: dest
      	item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
      		location key (0 UNKNOWN.0 0) type XATTR
      		transid 10 data_len 37 name_len 16
      		name: security.selinux
      		data unconfined_u:object_r:unlabeled_t:s0
      	item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
      		generation 10 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      		extent compression 0 (none)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e7f6def
    • Qu Wenruo's avatar
      btrfs: lzo: fix and simplify the inline extent decompression · 6a69631e
      Qu Wenruo authored
      [BUG]
      If we have a filesystem with 4k sectorsize, and an inlined compressed
      extent created like this:
      
      	item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
      		generation 8 transid 8 size 4096 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
      		index 2 namelen 14 name: source_inlined
      	item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
      		generation 8 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 2 (lzo)
      
      Then trying to reflink that extent in an aarch64 system with 64K page
      size, the reflink would just fail:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        XFS_IOC_CLONE_RANGE: Input/output error
      
      [CAUSE]
      In zlib_decompress(), we didn't treat @start_byte as just a page offset,
      but also use it as an indicator on whether we should error out, without
      any proper explanation (this is from the very beginning of btrfs).
      
      In reality, for subpage cases, although @start_byte can be non-zero,
      we should never switch input/output buffer nor error out, since the whole
      input/output buffer should never exceed one sector.
      
      Note: The above assumption is only not true if we're going to support
      multi-page sectorsize.
      
      Thus the current code using @start_byte as a condition to switch
      input/output buffer or finish the decompression is completely incorrect.
      
      [FIX]
      The fix involves several modifications:
      
      - Rename @start_byte to @dest_pgoff to properly express its meaning
      
      - Use @sectorsize other than PAGE_SIZE to properly initialize the
        output buffer size
      
      - Use correct destination offset inside the destination page
      
      - Use memcpy_to_page() to copy the contents to the destination page
      
      - Use memzero_page() to zero out the tailing part
      
      - Consider early end as an error
      
      After the fix, even on 64K page sized aarch64, above reflink now
      works as expected:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        linked 4096/4096 bytes at offset 61440
      
      And results the correct file layout:
      
      	item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
      		generation 10 transid 10 size 65536 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
      		index 3 namelen 4 name: dest
      	item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
      		location key (0 UNKNOWN.0 0) type XATTR
      		transid 10 data_len 37 name_len 16
      		name: security.selinux
      		data unconfined_u:object_r:unlabeled_t:s0
      	item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
      		generation 10 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      		extent compression 0 (none)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a69631e
    • Qu Wenruo's avatar
      btrfs: zlib: fix and simplify the inline extent decompression · 2c25716d
      Qu Wenruo authored
      [BUG]
      
      If we have a filesystem with 4k sectorsize, and an inlined compressed
      extent created like this:
      
      	item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
      		generation 8 transid 8 size 4096 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
      		index 2 namelen 14 name: source_inlined
      	item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
      		generation 8 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
      
      Which has an inline compressed extent at file offset 0, and its
      decompressed size is 4K, allowing us to reflink that 4K range to another
      location (which will not be compressed).
      
      If we do such reflink on a subpage system, it would fail like this:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        XFS_IOC_CLONE_RANGE: Input/output error
      
      [CAUSE]
      In zlib_decompress(), we didn't treat @start_byte as just a page offset,
      but also use it as an indicator on whether we should switch our output
      buffer.
      
      In reality, for subpage cases, although @start_byte can be non-zero,
      we should never switch input/output buffer, since the whole input/output
      buffer should never exceed one sector.
      
      Note: The above assumption is only not true if we're going to support
      multi-page sectorsize.
      
      Thus the current code using @start_byte as a condition to switch
      input/output buffer or finish the decompression is completely incorrect.
      
      [FIX]
      The fix involves several modifications:
      
      - Rename @start_byte to @dest_pgoff to properly express its meaning
      
      - Add an extra ASSERT() inside btrfs_decompress() to make sure the
        input/output size never exceeds one sector.
      
      - Use Z_FINISH flag to make sure the decompression happens in one go
      
      - Remove the loop needed to switch input/output buffers
      
      - Use correct destination offset inside the destination page
      
      - Consider early end as an error
      
      After the fix, even on 64K page sized aarch64, above reflink now
      works as expected:
      
        # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
        linked 4096/4096 bytes at offset 61440
      
      And resulted a correct file layout:
      
      	item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
      		generation 10 transid 10 size 65536 nbytes 4096
      		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
      		sequence 1 flags 0x0(none)
      	item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
      		index 3 namelen 4 name: dest
      	item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
      		location key (0 UNKNOWN.0 0) type XATTR
      		transid 10 data_len 37 name_len 16
      		name: security.selinux
      		data unconfined_u:object_r:unlabeled_t:s0
      	item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
      		generation 10 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      		extent compression 0 (none)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c25716d
  3. 12 Jan, 2024 9 commits
    • Qu Wenruo's avatar
      btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args · 173431b2
      Qu Wenruo authored
      Add extra sanity check for btrfs_ioctl_defrag_range_args::flags.
      
      This is not really to enhance fuzzing tests, but as a preparation for
      future expansion on btrfs_ioctl_defrag_range_args.
      
      In the future we're going to add new members, allowing more fine tuning
      for btrfs defrag.  Without the -ENONOTSUPP error, there would be no way
      to detect if the kernel supports those new defrag features.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      173431b2
    • Omar Sandoval's avatar
      btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted · 3324d054
      Omar Sandoval authored
      Sweet Tea spotted a race between subvolume deletion and snapshotting
      that can result in the root item for the snapshot having the
      BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
      
      Thread 1                                      | Thread 2
      ----------------------------------------------|----------
      btrfs_delete_subvolume                        |
        btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
                                                    |btrfs_mksubvol
                                                    |  down_read(subvol_sem)
                                                    |  create_snapshot
                                                    |    ...
                                                    |    create_pending_snapshot
                                                    |      copy root item from source
        down_write(subvol_sem)                      |
      
      This flag is only checked in send and swap activate, which this would
      cause to fail mysteriously.
      
      create_snapshot() now checks the root refs to reject a deleted
      subvolume, so we can fix this by locking subvol_sem earlier so that the
      BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3324d054
    • Omar Sandoval's avatar
      btrfs: don't abort filesystem when attempting to snapshot deleted subvolume · 7081929a
      Omar Sandoval authored
      If the source file descriptor to the snapshot ioctl refers to a deleted
      subvolume, we get the following abort:
      
        BTRFS: Transaction aborted (error -2)
        WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs]
        Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c
        CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
        RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs]
        RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027
        RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840
        RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998
        R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe
        R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80
        FS:  00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0
        Call Trace:
         <TASK>
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? __warn+0x81/0x130
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? report_bug+0x171/0x1a0
         ? handle_bug+0x3a/0x70
         ? exc_invalid_op+0x17/0x70
         ? asm_exc_invalid_op+0x1a/0x20
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         create_pending_snapshots+0x92/0xc0 [btrfs]
         btrfs_commit_transaction+0x66b/0xf40 [btrfs]
         btrfs_mksubvol+0x301/0x4d0 [btrfs]
         btrfs_mksnapshot+0x80/0xb0 [btrfs]
         __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs]
         btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs]
         btrfs_ioctl+0x8a6/0x2650 [btrfs]
         ? kmem_cache_free+0x22/0x340
         ? do_sys_openat2+0x97/0xe0
         __x64_sys_ioctl+0x97/0xd0
         do_syscall_64+0x46/0xf0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
        RIP: 0033:0x7fe20abe83af
        RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af
        RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58
         </TASK>
        ---[ end trace 0000000000000000 ]---
        BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry
        BTRFS info (device vdc: state EA): forced readonly
        BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction.
        BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry
      
      This happens because create_pending_snapshot() initializes the new root
      item as a copy of the source root item. This includes the refs field,
      which is 0 for a deleted subvolume. The call to btrfs_insert_root()
      therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then
      finds the root and returns -ENOENT if refs == 0, which causes
      create_pending_snapshot() to abort.
      
      Fix it by checking the source root's refs before attempting the
      snapshot, but after locking subvol_sem to avoid racing with deletion.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7081929a
    • Naohiro Aota's avatar
      btrfs: zoned: fix lock ordering in btrfs_zone_activate() · b18f3b60
      Naohiro Aota authored
      The btrfs CI reported a lockdep warning as follows by running generic
      generic/129.
      
         WARNING: possible circular locking dependency detected
         6.7.0-rc5+ #1 Not tainted
         ------------------------------------------------------
         kworker/u5:5/793427 is trying to acquire lock:
         ffff88813256d028 (&cache->lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x5e/0x130
         but task is already holding lock:
         ffff88810a23a318 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x34/0x130
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
         -> #1 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}:
         ...
         -> #0 (&cache->lock){+.+.}-{2:2}:
         ...
      
      This is because we take fs_info->zone_active_bgs_lock after a block_group's
      lock in btrfs_zone_activate() while doing the opposite in other places.
      
      Fix the issue by expanding the fs_info->zone_active_bgs_lock's critical
      section and taking it before a block_group's lock.
      
      Fixes: a7e1ac7b ("btrfs: zoned: reserve zones for an active metadata/system block group")
      CC: stable@vger.kernel.org # 6.6
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b18f3b60
    • Naohiro Aota's avatar
      btrfs: fix unbalanced unlock of mapping_tree_lock · d967c914
      Naohiro Aota authored
      The error path of btrfs_get_chunk_map() releases
      fs_info->mapping_tree_lock. But, it is taken and released in
      btrfs_find_chunk_map(). So, there is no need to do so.
      
      Fixes: 7dc66abb ("btrfs: use a dedicated data structure for chunk maps")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d967c914
    • Fedor Pchelkin's avatar
      btrfs: ref-verify: free ref cache before clearing mount opt · f03e274a
      Fedor Pchelkin authored
      As clearing REF_VERIFY mount option indicates there were some errors in a
      ref-verify process, a ref cache is not relevant anymore and should be
      freed.
      
      btrfs_free_ref_cache() requires REF_VERIFY option being set so call
      it just before clearing the mount option.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Reported-by: syzbot+be14ed7728594dc8bd42@syzkaller.appspotmail.com
      Fixes: fd708b81 ("Btrfs: add a extent ref verify tool")
      CC: stable@vger.kernel.org # 5.4+
      Closes: https://lore.kernel.org/lkml/000000000000e5a65c05ee832054@google.com/
      Reported-by: syzbot+c563a3c79927971f950f@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/lkml/0000000000007fe09705fdc6086c@google.com/Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f03e274a
    • Dmitry Antipov's avatar
      btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send() · 6ff09b6b
      Dmitry Antipov authored
      When compiling with gcc version 14.0.0 20231220 (experimental)
      and W=1, I've noticed the following warning:
      
      fs/btrfs/send.c: In function 'btrfs_ioctl_send':
      fs/btrfs/send.c:8208:44: warning: 'kvcalloc' sizes specified with 'sizeof'
      in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
       8208 |         sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
            |                                            ^
      
      Since 'n' and 'size' arguments of 'kvcalloc()' are multiplied to
      calculate the final size, their actual order doesn't affect the result
      and so this is not a bug. But it's still worth to fix it.
      Signed-off-by: default avatarDmitry Antipov <dmantipov@yandex.ru>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ff09b6b
    • Naohiro Aota's avatar
      btrfs: zoned: optimize hint byte for zoned allocator · 02444f2a
      Naohiro Aota authored
      Writing sequentially to a huge file on btrfs on a SMR HDD revealed a
      decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes).
      
      The performance goes down because of increased latency of the extent
      allocation, which is induced by a traversing of a lot of full block groups.
      
      So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group
      with sufficient size from the active block group list, which does not
      contain full block groups.
      
      After applying the patch, the performance is maintained well.
      
      Fixes: 2eda5708 ("btrfs: zoned: implement sequential extent allocation")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02444f2a
    • Naohiro Aota's avatar
      btrfs: zoned: factor out prepare_allocation_zoned() · b271fee9
      Naohiro Aota authored
      Factor out prepare_allocation_zoned() for further extension. While at
      it, optimize the if-branch a bit.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b271fee9
  4. 15 Dec, 2023 20 commits
    • Johannes Thumshirn's avatar
      btrfs: pass btrfs_io_geometry into btrfs_max_io_len · e94dfb7a
      Johannes Thumshirn authored
      Instead of passing three individual members of 'struct btrfs_io_geometry'
      into btrfs_max_io_len(), pass a pointer to btrfs_io_geometry.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e94dfb7a
    • Johannes Thumshirn's avatar
      btrfs: pass struct btrfs_io_geometry to set_io_stripe · 6edf6822
      Johannes Thumshirn authored
      Instead of passing three members of 'struct btrfs_io_geometry' into
      set_io_stripe() pass a pointer to the whole structure and then get the needed
      members out of btrfs_io_geometry.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6edf6822
    • Johannes Thumshirn's avatar
      btrfs: open code set_io_stripe for RAID56 · 89f547c6
      Johannes Thumshirn authored
      Open code set_io_stripe() for RAID56, as it
      
      a) uses a different method to calculate the stripe_index
      b) doesn't need to go through raid-stripe-tree mapping code.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      89f547c6
    • Johannes Thumshirn's avatar
      btrfs: change block mapping to switch/case in btrfs_map_block · b55b3077
      Johannes Thumshirn authored
      Now that all the per-profile if/else statement blocks have been
      converted to calls to helper the conversion to switch/case is
      straightforward.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b55b3077
    • Johannes Thumshirn's avatar
      btrfs: factor out block mapping for single profiles · a16fb8c6
      Johannes Thumshirn authored
      Now that we have a container for the I/O geometry that has all the needed
      information for the block mappings of SINGLE profiles, factor out a helper
      calculating this information.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a16fb8c6
    • Johannes Thumshirn's avatar
      btrfs: factor out block mapping for RAID5/6 · 089221d3
      Johannes Thumshirn authored
      Now that we have a container for the I/O geometry that has all the needed
      information for the block mappings of RAID5 and RAID6, factor out a helper
      calculating this information.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      089221d3
    • Johannes Thumshirn's avatar
      btrfs: reduce scope of data_stripes in btrfs_map_block · d9d4ce9f
      Johannes Thumshirn authored
      Reduce the scope of 'data_stripes' in btrfs_map_block(). While the
      change alone may not make too much sense, it helps us factoring out a
      helper function for the block mapping of RAID56 I/O.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9d4ce9f
    • Johannes Thumshirn's avatar
      btrfs: factor out block mapping for RAID10 · 8938f112
      Johannes Thumshirn authored
      Now that we have a container for the I/O geometry that has all the needed
      information for the block mappings of RAID10, factor out a helper calculating
      this information.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8938f112
    • Johannes Thumshirn's avatar
      btrfs: factor out block mapping for DUP profiles · 5aeb15c8
      Johannes Thumshirn authored
      Now that we have a container for the I/O geometry that has all the needed
      information for the block mappings of DUP, factor out a helper calculating
      this information.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5aeb15c8
    • Johannes Thumshirn's avatar
      btrfs: factor out RAID1 block mapping · 5e36aba8
      Johannes Thumshirn authored
      Now that we have a container for the I/O geometry that has all the needed
      information for the block mappings of RAID1, factor out a helper calculating
      this information.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e36aba8
    • Johannes Thumshirn's avatar
      btrfs: factor out block-mapping for RAID0 · 30e8534b
      Johannes Thumshirn authored
      Now that we have a container for the I/O geometry that has all the needed
      information for the block mappings of RAID0, factor out a helper calculating
      this information.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30e8534b
    • Johannes Thumshirn's avatar
      btrfs: re-introduce struct btrfs_io_geometry · fd747f2d
      Johannes Thumshirn authored
      Re-introduce struct btrfs_io_geometry, holding the necessary bits and
      pieces needed in btrfs_map_block() to decide the I/O geometry of a specific
      block mapping.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd747f2d
    • Johannes Thumshirn's avatar
      btrfs: factor out helper for single device IO check · 02d05b64
      Johannes Thumshirn authored
      The check in btrfs_map_block() deciding if a particular I/O is targeting a
      single device is getting more and more convoluted.
      
      Factor out the check conditions into a helper function, with no functional
      change otherwise.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02d05b64
    • Qu Wenruo's avatar
      btrfs: migrate btrfs_repair_io_failure() to folio interfaces · 96c36eaa
      Qu Wenruo authored
      [BUG]
      Test case btrfs/124 failed if larger metadata folio is enabled, the
      dying message looks like this:
      
       BTRFS error (device dm-2): bad tree block start, mirror 2 want 31686656 have 0
       BTRFS info (device dm-2): read error corrected: ino 0 off 31686656 (dev /dev/mapper/test-scratch2 sector 20928)
       BUG: kernel NULL pointer dereference, address: 0000000000000020
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       CPU: 6 PID: 350881 Comm: btrfs Tainted: G           OE      6.7.0-rc3-custom+ #128
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
       RIP: 0010:btrfs_read_extent_buffer+0x106/0x180 [btrfs]
       PKRU: 55555554
       Call Trace:
        <TASK>
        read_tree_block+0x33/0xb0 [btrfs]
        read_block_for_search+0x23e/0x340 [btrfs]
        btrfs_search_slot+0x2f9/0xe60 [btrfs]
        btrfs_lookup_csum+0x75/0x160 [btrfs]
        btrfs_lookup_bio_sums+0x21a/0x560 [btrfs]
        btrfs_submit_chunk+0x152/0x680 [btrfs]
        btrfs_submit_bio+0x1c/0x50 [btrfs]
        submit_one_bio+0x40/0x80 [btrfs]
        submit_extent_page+0x158/0x390 [btrfs]
        btrfs_do_readpage+0x330/0x740 [btrfs]
        extent_readahead+0x38d/0x6c0 [btrfs]
        read_pages+0x94/0x2c0
        page_cache_ra_unbounded+0x12d/0x190
        relocate_file_extent_cluster+0x7c1/0x9d0 [btrfs]
        relocate_block_group+0x2d3/0x560 [btrfs]
        btrfs_relocate_block_group+0x2c7/0x4b0 [btrfs]
        btrfs_relocate_chunk+0x4c/0x1a0 [btrfs]
        btrfs_balance+0x925/0x13c0 [btrfs]
        btrfs_ioctl+0x19f1/0x25d0 [btrfs]
        __x64_sys_ioctl+0x90/0xd0
        do_syscall_64+0x3f/0xf0
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      [CAUSE]
      The dying line is at btrfs_repair_io_failure() call inside
      btrfs_repair_eb_io_failure().
      
      The function is still relying on the extent buffer using page sized
      folios.
      When the extent buffer is using larger folio, we go into the 2nd slot of
      folios[], and triggered the NULL pointer dereference.
      
      [FIX]
      Migrate btrfs_repair_io_failure() to folio interfaces.
      
      So that when we hit a larger folio, we just submit the whole folio in
      one go.
      
      This also affects data repair path through btrfs_end_repair_bio(),
      thankfully data is still fully page based, we can just add an
      ASSERT(), and use page_folio() to convert the page to folio.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96c36eaa
    • Qu Wenruo's avatar
      btrfs: migrate eb_bitmap_offset() to folio interfaces · f4521b01
      Qu Wenruo authored
      [BUG]
      Test case btrfs/002 would fail if larger folios are enabled for
      metadata:
      
       assertion failed: folio, in fs/btrfs/extent_io.c:4358
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/extent_io.c:4358!
       invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 30916 Comm: fsstress Tainted: G           OE      6.7.0-rc3-custom+ #128
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
       RIP: 0010:assert_eb_folio_uptodate+0x98/0xe0 [btrfs]
       Call Trace:
        <TASK>
        extent_buffer_test_bit+0x3c/0x70 [btrfs]
        free_space_test_bit+0xcd/0x140 [btrfs]
        modify_free_space_bitmap+0x27a/0x430 [btrfs]
        add_to_free_space_tree+0x8d/0x160 [btrfs]
        __btrfs_free_extent.isra.0+0xef1/0x13c0 [btrfs]
        __btrfs_run_delayed_refs+0x786/0x13c0 [btrfs]
        btrfs_run_delayed_refs+0x33/0x120 [btrfs]
        btrfs_commit_transaction+0xa2/0x1350 [btrfs]
        iterate_supers+0x77/0xe0
        ksys_sync+0x60/0xa0
        __do_sys_sync+0xa/0x20
        do_syscall_64+0x3f/0xf0
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
        </TASK>
      
      [CAUSE]
      The function extent_buffer_test_bit() is not folio compatible.
      
      It still assumes the old fixed page size, when an extent buffer with
      large folio passed in, only eb->folios[0] is populated.
      
      Then if the target bit range falls in the 2nd page of the folio, then we
      would check eb->folios[1], and trigger the ASSERT().
      
      [FIX]
      Just migrate eb_bitmap_offset() to folio interfaces, using the
      folio_size() to replace PAGE_SIZE.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4521b01
    • Qu Wenruo's avatar
      btrfs: migrate various end io functions to folios · a700ca5e
      Qu Wenruo authored
      If we still go the old page based iterator functions, like
      bio_for_each_segment_all(), we can hit middle pages of a folio (compound
      page).
      
      In that case if we set any page flag on those middle pages, we can
      easily trigger VM_BUG_ON(), as for compound page flags, they should
      follow their flag policies (normally only set on leading or tail pages).
      
      To avoid such problem in the future full folio migration, here we do:
      
      - Change from bio_for_each_segment_all() to bio_for_each_folio_all()
        This completely removes the ability to access the middle page.
      
      - Add extra ASSERT()s for data read/write paths
        To ensure we only get single paged folio for data now.
      
      - Rename those end io functions to follow a certain schema
        * end_bbio_compressed_read()
        * end_bbio_compressed_write()
      
          These two endio functions don't set any page flags, as they use pages
          not mapped to any address space.
          They can be very good candidates for higher order folio testing.
      
          And they are shared between compression and encoded IO.
      
        * end_bbio_data_read()
        * end_bbio_data_write()
        * end_bbio_meta_read()
        * end_bbio_meta_write()
      
        The old function names are not unified:
          - end_bio_extent_writepage()
          - end_bio_extent_readpage()
          - extent_buffer_write_end_io()
          - extent_buffer_read_end_io()
      
        They share no schema on where the "end_*io" string should be, nor can
        be confusing just using "extent_buffer" and "extent" to distinguish
        data and metadata paths.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a700ca5e
    • Qu Wenruo's avatar
      btrfs: migrate subpage code to folio interfaces · 55151ea9
      Qu Wenruo authored
      Although subpage itself is conflicting with higher folio, since subpage
      (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never
      need higher order folio, there is a hidden pitfall:
      
      - btrfs_page_*() helpers
      
      Those helpers are an abstraction to handle both subpage and non-subpage
      cases, which means we're going to pass pages pointers to those helpers.
      
      And since those helpers are shared between data and metadata paths, it's
      unavoidable to let them to handle folios, including higher order
      folios).
      
      Meanwhile for true subpage case, we should only have a single page
      backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert()
      to ensure that.
      
      Also since those helpers are shared between both data and metadata, add
      some extra ASSERT()s for data path to make sure we only get single page
      backed folio for now.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55151ea9
    • Qu Wenruo's avatar
      btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios · 8d993618
      Qu Wenruo authored
      These two functions are still using the old page based code, which is
      not going to handle larger folios at all.
      
      The migration itself is going to involve the following changes:
      
      - PAGE_SIZE -> folio_size()
      - PAGE_SHIFT -> folio_shift()
      - get_eb_page_index() -> get_eb_folio_index()
      - get_eb_offset_in_page() -> get_eb_offset_in_folio()
      
      And since we're going to support larger folios, although above straight
      conversion is good enough, this patch would add extra comments in the
      involved functions to explain why the same single line code can now
      cover 3 cases:
      
      - folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
        The common, non-subpage case with per-page folio.
      
      - folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
        The incoming larger folio, non-subpage case.
      
      - folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
        The existing subpage case, we won't larger folio anyway.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d993618
    • Josef Bacik's avatar
      btrfs: don't double put our subpage reference in alloc_extent_buffer · 4a565c80
      Josef Bacik authored
      This fixes as case in "btrfs: refactor alloc_extent_buffer() to
      allocate-then-attach method".
      
      We have been seeing panics in the CI for the subpage stuff recently, it
      happens on btrfs/187 but could potentially happen anywhere.
      
      In the subpage case, if we race with somebody else inserting the same
      extent buffer, the error case will end up calling
      detach_extent_buffer_page() on the page twice.
      
      This is done first in the bit
      
      for (int i = 0; i < attached; i++)
      	detach_extent_buffer_page(eb, eb->pages[i];
      
      and then again in btrfs_release_extent_buffer().
      
      This works fine for !subpage because we're the only person who ever has
      ourselves on the private, and so when we do the initial
      detach_extent_buffer_page() we know we've completely removed it.
      
      However for subpage we could be using this page private elsewhere, so
      this results in a double put on the subpage, which can result in an
      early freeing.
      
      The fix here is to clear eb->pages[i] for everything we detach.  Then
      anything still attached to the eb is freed in
      btrfs_release_extent_buffer().
      
      Because of this change we must update
      btrfs_release_extent_buffer_pages() to not use num_extent_folios,
      because it assumes eb->folio[0] is set properly.  Since this is only
      interested in freeing any pages we have on the extent buffer we can
      simply use INLINE_EXTENT_BUFFER_PAGES.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a565c80
    • Qu Wenruo's avatar
      btrfs: cleanup metadata page pointer usage · 13df3775
      Qu Wenruo authored
      Although we have migrated extent_buffer::pages[] to folios[], we're
      still mostly using the folio_page() help to grab the page.
      
      This patch would do the following cleanups for metadata:
      
      - Introduce num_extent_folios() helper
        This is to replace most num_extent_pages() callers.
      
      - Use num_extent_folios() to iterate future large folios
        This allows us to use things like
        bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
        for the folio (aka the leading/tailing page), which reduces the loop
        iteration to 1 for large folios.
      
      - Change metadata related functions to use folio pointers
        Including their function name, involving:
        * attach_extent_buffer_page()
        * detach_extent_buffer_page()
        * page_range_has_eb()
        * btrfs_release_extent_buffer_pages()
        * btree_clear_page_dirty()
        * btrfs_page_inc_eb_refs()
        * btrfs_page_dec_eb_refs()
      
      - Change btrfs_is_subpage() to accept an address_space pointer
        This is to allow both page->mapping and folio->mapping to be utilized.
        As data is still using the old per-page code, and may keep so for a
        while.
      
      - Special corner case place holder for future order mismatches between
        extent buffer and inode filemap
        For now it's  just a block of comments and a dead ASSERT(), no real
        handling yet.
      
      The subpage code would still go page, just because subpage and large
      folio are conflicting conditions, thus we don't need to bother subpage
      with higher order folios at all. Just folio_page(folio, 0) would be
      enough.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor styling tweaks ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13df3775