1. 05 Jun, 2024 1 commit
    • Omar Sandoval's avatar
      btrfs: fix crash on racing fsync and size-extending write into prealloc · 9d274c19
      Omar Sandoval authored
      We have been seeing crashes on duplicate keys in
      btrfs_set_item_key_safe():
      
        BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:2620!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
        RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
      
      With the following stack trace:
      
        #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
        #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
        #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
        #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
        #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
        #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
        #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
        #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
        #8  vfs_fsync_range (fs/sync.c:188:9)
        #9  vfs_fsync (fs/sync.c:202:9)
        #10 do_fsync (fs/sync.c:212:9)
        #11 __do_sys_fdatasync (fs/sync.c:225:9)
        #12 __se_sys_fdatasync (fs/sync.c:223:1)
        #13 __x64_sys_fdatasync (fs/sync.c:223:1)
        #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
        #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
        #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
      
      So we're logging a changed extent from fsync, which is splitting an
      extent in the log tree. But this split part already exists in the tree,
      triggering the BUG().
      
      This is the state of the log tree at the time of the crash, dumped with
      drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
      to get more details than btrfs_print_leaf() gives us:
      
        >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
        leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
        leaf 33439744 flags 0x100000000000000
        fs uuid e5bd3946-400c-4223-8923-190ef1f18677
        chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
                item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                        generation 7 transid 9 size 8192 nbytes 8473563889606862198
                        block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                        sequence 204 flags 0x10(PREALLOC)
                        atime 1716417703.220000000 (2024-05-22 15:41:43)
                        ctime 1716417704.983333333 (2024-05-22 15:41:44)
                        mtime 1716417704.983333333 (2024-05-22 15:41:44)
                        otime 17592186044416.000000000 (559444-03-08 01:40:16)
                item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                        index 195 namelen 3 name: 193
                item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                        location key (0 UNKNOWN.0 0) type XATTR
                        transid 7 data_len 1 name_len 6
                        name: user.a
                        data a
                item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                        generation 9 type 1 (regular)
                        extent data disk byte 303144960 nr 12288
                        extent data offset 0 nr 4096 ram 12288
                        extent compression 0 (none)
                item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 4096 nr 8192
                item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 8192 nr 4096
        ...
      
      So the real problem happened earlier: notice that items 4 (4k-12k) and 5
      (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
      item 5 starts at i_size.
      
      Here is the state of the filesystem tree at the time of the crash:
      
        >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
        >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
        >>> print_extent_buffer(nodes[0])
        leaf 30425088 level 0 items 184 generation 9 owner 5
        leaf 30425088 flags 0x100000000000000
        fs uuid e5bd3946-400c-4223-8923-190ef1f18677
        chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
        	...
                item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                        generation 7 transid 7 size 4096 nbytes 12288
                        block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                        sequence 6 flags 0x10(PREALLOC)
                        atime 1716417703.220000000 (2024-05-22 15:41:43)
                        ctime 1716417703.220000000 (2024-05-22 15:41:43)
                        mtime 1716417703.220000000 (2024-05-22 15:41:43)
                        otime 1716417703.220000000 (2024-05-22 15:41:43)
                item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                        index 195 namelen 3 name: 193
                item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                        location key (0 UNKNOWN.0 0) type XATTR
                        transid 7 data_len 1 name_len 6
                        name: user.a
                        data a
                item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                        generation 9 type 1 (regular)
                        extent data disk byte 303144960 nr 12288
                        extent data offset 0 nr 8192 ram 12288
                        extent compression 0 (none)
                item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 8192 nr 4096
      
      Item 5 in the log tree corresponds to item 183 in the filesystem tree,
      but nothing matches item 4. Furthermore, item 183 is the last item in
      the leaf.
      
      btrfs_log_prealloc_extents() is responsible for logging prealloc extents
      beyond i_size. It first truncates any previously logged prealloc extents
      that start beyond i_size. Then, it walks the filesystem tree and copies
      the prealloc extent items to the log tree.
      
      If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
      unlocks the tree and does another search. However, while the filesystem
      tree is unlocked, an ordered extent completion may modify the tree. In
      particular, it may insert an extent item that overlaps with an extent
      item that was already copied to the log tree.
      
      This may manifest in several ways depending on the exact scenario,
      including an EEXIST error that is silently translated to a full sync,
      overlapping items in the log tree, or this crash. This particular crash
      is triggered by the following sequence of events:
      
      - Initially, the file has i_size=4k, a regular extent from 0-4k, and a
        prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
        the last item in its B-tree leaf.
      - The file is fsync'd, which copies its inode item and both extent items
        to the log tree.
      - An xattr is set on the file, which sets the
        BTRFS_INODE_COPY_EVERYTHING flag.
      - The range 4k-8k in the file is written using direct I/O. i_size is
        extended to 8k, but the ordered extent is still in flight.
      - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
        calls copy_inode_items_to_log(), which calls
        btrfs_log_prealloc_extents().
      - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
        filesystem tree. Since it starts before i_size, it skips it. Since it
        is the last item in its B-tree leaf, it calls btrfs_next_leaf().
      - btrfs_next_leaf() unlocks the path.
      - The ordered extent completion runs, which converts the 4k-8k part of
        the prealloc extent to written and inserts the remaining prealloc part
        from 8k-12k.
      - btrfs_next_leaf() does a search and finds the new prealloc extent
        8k-12k.
      - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
        the log tree. Note that it overlaps with the 4k-12k prealloc extent
        that was copied to the log tree by the first fsync.
      - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
        extent that was written.
      - This tries to drop the range 4k-8k in the log tree, which requires
        adjusting the start of the 4k-12k prealloc extent in the log tree to
        8k.
      - btrfs_set_item_key_safe() sees that there is already an extent
        starting at 8k in the log tree and calls BUG().
      
      Fix this by detecting when we're about to insert an overlapping file
      extent item in the log tree and truncating the part that would overlap.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d274c19
  2. 28 May, 2024 1 commit
    • Filipe Manana's avatar
      btrfs: ensure fast fsync waits for ordered extents after a write failure · f13e01b8
      Filipe Manana authored
      If a write path in COW mode fails, either before submitting a bio for the
      new extents or an actual IO error happens, we can end up allowing a fast
      fsync to log file extent items that point to unwritten extents.
      
      This is because dropping the extent maps happens when completing ordered
      extents, at btrfs_finish_one_ordered(), and the completion of an ordered
      extent is executed in a work queue.
      
      This can result in a fast fsync to start logging file extent items based
      on existing extent maps before the ordered extents complete, therefore
      resulting in a log that has file extent items that point to unwritten
      extents, resulting in a corrupt file if a crash happens after and the log
      tree is replayed the next time the fs is mounted.
      
      This can happen for both direct IO writes and buffered writes.
      
      For example consider a direct IO write, in COW mode, that fails at
      btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
      error:
      
      1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
         set to false, meaning an error happened;
      
      2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
         flag;
      
      3) btrfs_finish_ordered_extent() queues the completion of the ordered
         extent - so that btrfs_finish_one_ordered() will be executed later in
         a work queue. That function will drop extent maps in the range when
         it's executed, since the extent maps point to unwritten locations
         (signaled by the BTRFS_ORDERED_IOERR flag);
      
      4) After calling btrfs_finish_ordered_extent() we keep going down the
         write path and unlock the inode;
      
      5) After that a fast fsync starts and locks the inode;
      
      6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
         task sees the extent maps that point to the unwritten locations and
         logs file extent items based on them - it does not know they are
         unwritten, and the fast fsync path does not wait for ordered extents
         to complete, which is an intentional behaviour in order to reduce
         latency.
      
      For the buffered write case, here's one example:
      
      1) A fast fsync begins, and it starts by flushing delalloc and waiting for
         the writeback to complete by calling filemap_fdatawait_range();
      
      2) Flushing the dellaloc created a new extent map X;
      
      3) During the writeback some IO error happened, and at the end io callback
         (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
         sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
         completion;
      
      4) After queuing the ordered extent completion, the end io callback clears
         the writeback flag from all pages (or folios), and from that moment the
         fast fsync can proceed;
      
      5) The fast fsync proceeds sees extent map X and logs a file extent item
         based on extent map X, resulting in a log that points to an unwritten
         data extent - because the ordered extent completion hasn't run yet, it
         happens only after the logging.
      
      To fix this make btrfs_finish_ordered_extent() set the inode flag
      BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
      so that a fast fsync will wait for ordered extent completion.
      
      Note that this issues of using extent maps that point to unwritten
      locations can not happen for reads, because in read paths we start by
      locking the extent range and wait for any ordered extents in the range
      to complete before looking for extent maps.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f13e01b8
  3. 21 May, 2024 1 commit
  4. 15 May, 2024 5 commits
    • Filipe Manana's avatar
      btrfs: fix end of tree detection when searching for data extent ref · dddff821
      Filipe Manana authored
      At lookup_extent_data_ref() we are incorrectly checking if we are at the
      last slot of the last leaf in the extent tree. We are returning -ENOENT
      if btrfs_next_leaf() returns a value greater than 1, but btrfs_next_leaf()
      never returns anything greater than 1:
      
      1) It returns < 0 on error;
      
      2) 0 if there is a next leaf (or a new item was added to the end of the
         current leaf after releasing the path);
      
      3) 1 if there are no more leaves (and no new items were added to the last
         leaf after releasing the path).
      
      So fix this by checking if the return value is greater than zero instead
      of being greater than one.
      
      Fixes: 1618aa3c ("btrfs: simplify return variables in lookup_extent_data_ref()")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dddff821
    • Lu Yao's avatar
      btrfs: scrub: initialize ret in scrub_simple_mirror() to fix compilation warning · b4e585ff
      Lu Yao authored
      The following error message is displayed:
        ../fs/btrfs/scrub.c:2152:9: error: ‘ret’ may be used uninitialized
        in this function [-Werror=maybe-uninitialized]"
      
      Compiler version: gcc version: (Debian 10.2.1-6) 10.2.1 20210110
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarLu Yao <yaolu@kylinos.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4e585ff
    • Filipe Manana's avatar
      btrfs: zoned: fix use-after-free due to race with dev replace · 0090d6e1
      Filipe Manana authored
      While loading a zone's info during creation of a block group, we can race
      with a device replace operation and then trigger a use-after-free on the
      device that was just replaced (source device of the replace operation).
      
      This happens because at btrfs_load_zone_info() we extract a device from
      the chunk map into a local variable and then use the device while not
      under the protection of the device replace rwsem. So if there's a device
      replace operation happening when we extract the device and that device
      is the source of the replace operation, we will trigger a use-after-free
      if before we finish using the device the replace operation finishes and
      frees the device.
      
      Fix this by enlarging the critical section under the protection of the
      device replace rwsem so that all uses of the device are done inside the
      critical section.
      
      CC: stable@vger.kernel.org # 6.1.x: 15c12fcc: btrfs: zoned: introduce a zone_info struct in btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x: 09a46725: btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x: 9e0e3e74: btrfs: zoned: factor out single bg handling from btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x: 87463f7e: btrfs: zoned: factor out DUP bg handling from btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0090d6e1
    • Boris Burkov's avatar
      btrfs: qgroup: fix qgroup id collision across mounts · 2b8aa78c
      Boris Burkov authored
      If we delete subvolumes whose ID is the largest in the filesystem, then
      unmount and mount again, then btrfs_init_root_free_objectid on the
      tree_root will select a subvolid smaller than that one and thus allow
      reusing it.
      
      If we are also using qgroups (and particularly squotas) it is possible
      to delete the subvol without deleting the qgroup. In that case, we will
      be able to create a new subvol whose id already has a level 0 qgroup.
      This will result in re-using that qgroup which would then lead to
      incorrect accounting.
      
      Fixes: 6ed05643 ("btrfs: create qgroup earlier in snapshot creation")
      CC: stable@vger.kernel.org # 6.7+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b8aa78c
    • David Sterba's avatar
      btrfs: qgroup: update rescan message levels and error codes · 1fa7603d
      David Sterba authored
      On filesystems without enabled quotas there's still a warning message in
      the logs when rescan is called. In that case it's not a problem that
      should be reported, rescan can be called unconditionally.  Change the
      error code to ENOTCONN which is used for 'quotas not enabled' elsewhere.
      
      Remove message (also a warning) when rescan is called during an ongoing
      rescan, this brings no useful information and the error code is
      sufficient.
      
      Change message levels to debug for now, they can be removed eventually.
      
      CC: stable@vger.kernel.org # 6.6+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fa7603d
  5. 07 May, 2024 32 commits