1. 23 Oct, 2023 1 commit
    • Filipe Manana's avatar
      btrfs: fix unwritten extent buffer after snapshotting a new subvolume · eb96e221
      Filipe Manana authored
      When creating a snapshot of a subvolume that was created in the current
      transaction, we can end up not persisting a dirty extent buffer that is
      referenced by the snapshot, resulting in IO errors due to checksum failures
      when trying to read the extent buffer later from disk. A sequence of steps
      that leads to this is the following:
      
      1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
         address 36007936, for the leaf/root of a new subvolume that has an ID
         of 291. We mark the extent buffer as dirty, and at this point the
         subvolume tree has a single node/leaf which is also its root (level 0);
      
      2) We no longer commit the transaction used to create the subvolume at
         create_subvol(). We used to, but that was recently removed in
         commit 1b53e51a ("btrfs: don't commit transaction for every subvol
         create");
      
      3) The transaction used to create the subvolume has an ID of 33, so the
         extent buffer 36007936 has a generation of 33;
      
      4) Several updates happen to subvolume 291 during transaction 33, several
         files created and its tree height changes from 0 to 1, so we end up with
         a new root at level 1 and the extent buffer 36007936 is now a leaf of
         that new root node, which is extent buffer 36048896.
      
         The commit root remains as 36007936, since we are still at transaction
         33;
      
      5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
         ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
         we end up at transaction.c:create_pending_snapshot(), in the critical
         section of a transaction commit.
      
         There we COW the root of subvolume 291, which is extent buffer 36048896.
         The COW operation returns extent buffer 36048896, since there's no need
         to COW because the extent buffer was created in this transaction and it
         was not written yet.
      
         The we call btrfs_copy_root() against the root node 36048896. During
         this operation we allocate a new extent buffer to turn into the root
         node of the snapshot, copy the contents of the root node 36048896 into
         this snapshot root extent buffer, set the owner to 292 (the ID of the
         snapshot), etc, and then we call btrfs_inc_ref(). This will create a
         delayed reference for each leaf pointed by the root node with a
         reference root of 292 - this includes a reference for the leaf
         36007936.
      
         After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.
      
         Then we call btrfs_insert_dir_item(), to create the directory entry in
         in the tree of subvolume 291 that points to the snapshot. This ends up
         needing to modify leaf 36007936 to insert the respective directory
         items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
         we need to COW the leaf. We end up at btrfs_force_cow_block() and then
         at update_ref_for_cow().
      
         At update_ref_for_cow() we call btrfs_block_can_be_shared() which
         returns false, despite the fact the leaf 36007936 is shared - the
         subvolume's root and the snapshot's root point to that leaf. The
         reason that it incorrectly returns false is because the commit root
         of the subvolume is extent buffer 36007936 - it was the initial root
         of the subvolume when we created it. So btrfs_block_can_be_shared()
         which has the following logic:
      
         int btrfs_block_can_be_shared(struct btrfs_root *root,
                                       struct extent_buffer *buf)
         {
             if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
                 buf != root->node && buf != root->commit_root &&
                 (btrfs_header_generation(buf) <=
                  btrfs_root_last_snapshot(&root->root_item) ||
                  btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
                     return 1;
      
             return 0;
         }
      
         Returns false (0) since 'buf' (extent buffer 36007936) matches the
         root's commit root.
      
         As a result, at update_ref_for_cow(), we don't check for the number
         of references for extent buffer 36007936, we just assume it's not
         shared and therefore that it has only 1 reference, so we set the local
         variable 'refs' to 1.
      
         Later on, in the final if-else statement at update_ref_for_cow():
      
         static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                                struct btrfs_root *root,
                                                struct extent_buffer *buf,
                                                struct extent_buffer *cow,
                                                int *last_ref)
         {
            (...)
            if (refs > 1) {
                (...)
            } else {
                (...)
                btrfs_clear_buffer_dirty(trans, buf);
                *last_ref = 1;
            }
         }
      
         So we mark the extent buffer 36007936 as not dirty, and as a result
         we don't write it to disk later in the transaction commit, despite the
         fact that the snapshot's root points to it.
      
         Attempting to access the leaf or dumping the tree for example shows
         that the extent buffer was not written:
      
         $ btrfs inspect-internal dump-tree -t 292 /dev/sdb
         btrfs-progs v6.2.2
         file tree key (292 ROOT_ITEM 33)
         node 36110336 level 1 items 2 free space 119 generation 33 owner 292
         node 36110336 flags 0x1(WRITTEN) backref revision 1
         checksum stored a8103e3e
         checksum calced a8103e3e
         fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
         chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
                 key (256 INODE_ITEM 0) block 36007936 gen 33
                 key (257 EXTENT_DATA 0) block 36052992 gen 33
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         total bytes 107374182400
         bytes used 38572032
         uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
      
         The respective on disk region is full of zeroes as the device was
         trimmed at mkfs time.
      
         Obviously 'btrfs check' also detects and complains about this:
      
         $ btrfs check /dev/sdb
         Opening filesystem to check...
         Checking filesystem on /dev/sdb
         UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
         generation: 33 (33)
         [1/7] checking root items
         [2/7] checking extents
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         bad tree block 36007936, bytenr mismatch, want=36007936, have=0
         owner ref check failed [36007936 4096]
         ERROR: errors found in extent allocation tree or chunk allocation
         [3/7] checking free space tree
         [4/7] checking fs roots
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
         bad tree block 36007936, bytenr mismatch, want=36007936, have=0
         The following tree block(s) is corrupted in tree 292:
              tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
         root 292 root dir 256 not found
         ERROR: errors found in fs roots
         found 38572032 bytes used, error(s) found
         total csum bytes: 16048
         total tree bytes: 1265664
         total fs tree bytes: 1118208
         total extent tree bytes: 65536
         btree space waste bytes: 562598
         file data blocks allocated: 65978368
          referenced 36569088
      
      Fix this by updating btrfs_block_can_be_shared() to consider that an
      extent buffer may be shared if it matches the commit root and if its
      generation matches the current transaction's generation.
      
      This can be reproduced with the following script:
      
         $ cat test.sh
         #!/bin/bash
      
         MNT=/mnt/sdi
         DEV=/dev/sdi
      
         # Use a filesystem with a 64K node size so that we have the same node
         # size on every machine regardless of its page size (on x86_64 default
         # node size is 16K due to the 4K page size, while on PPC it's 64K by
         # default). This way we can make sure we are able to create a btree for
         # the subvolume with a height of 2.
         mkfs.btrfs -f -n 64K $DEV
         mount $DEV $MNT
      
         btrfs subvolume create $MNT/subvol
      
         # Create a few empty files on the subvolume, this bumps its btree
         # height to 2 (root node at level 1 and 2 leaves).
         for ((i = 1; i <= 300; i++)); do
             echo -n > $MNT/subvol/file_$i
         done
      
         btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap
      
         umount $DEV
      
         btrfs check $DEV
      
      Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):
      
         $ ./test.sh
         Create subvolume '/mnt/sdi/subvol'
         Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
         Opening filesystem to check...
         Checking filesystem on /dev/sdi
         UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
         [1/7] checking root items
         [2/7] checking extents
         parent transid verify failed on 30539776 wanted 7 found 5
         parent transid verify failed on 30539776 wanted 7 found 5
         parent transid verify failed on 30539776 wanted 7 found 5
         Ignoring transid failure
         owner ref check failed [30539776 65536]
         ERROR: errors found in extent allocation tree or chunk allocation
         [3/7] checking free space tree
         [4/7] checking fs roots
         parent transid verify failed on 30539776 wanted 7 found 5
         Ignoring transid failure
         Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
         Wrong generation of child node/leaf, wanted: 5, have: 7
         root 257 root dir 256 not found
         ERROR: errors found in fs roots
         found 917504 bytes used, error(s) found
         total csum bytes: 0
         total tree bytes: 851968
         total fs tree bytes: 393216
         total extent tree bytes: 65536
         btree space waste bytes: 736550
         file data blocks allocated: 0
          referenced 0
      
      A test case for fstests will follow soon.
      
      Fixes: 1b53e51a ("btrfs: don't commit transaction for every subvol create")
      CC: stable@vger.kernel.org # 6.5+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb96e221
  2. 15 Oct, 2023 1 commit
    • Zygo Blaxell's avatar
      btrfs: fix stripe length calculation for non-zoned data chunk allocation · 8a540e99
      Zygo Blaxell authored
      Commit f6fca391 "btrfs: store chunk size in space-info struct"
      broke data chunk allocations on non-zoned multi-device filesystems when
      using default chunk_size.  Commit 5da431b7 "btrfs: fix the max chunk
      size and stripe length calculation" partially fixed that, and this patch
      completes the fix for that case.
      
      After commit f6fca391 and 5da431b7, the sequence of events for
      a data chunk allocation on a non-zoned filesystem is:
      
              1.  btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
              space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
              unmodified.  Before f6fca391, ctl->max_stripe_len value was
              1 GiB for non-zoned data chunks and not configurable.
      
              2.  btrfs_create_chunk calls gather_device_info which consumes
              and produces more fields of chunk_ctl.
      
              3.  gather_device_info multiplies ctl->max_stripe_len by
              ctl->dev_stripes (which is 1 in all cases except dup)
              and calls find_free_dev_extent with that number as num_bytes.
      
              4.  find_free_dev_extent locates the first dev_extent hole on
              a device which is at least as large as num_bytes.  With default
              max_chunk_size from f6fca391, it finds the first hole which is
              longer than 10 GiB, or the largest hole if that hole is shorter
              than 10 GiB.  This is different from the pre-f6fca391
              behavior, where num_bytes is 1 GiB, and find_free_dev_extent
              may choose a different hole.
      
              5.  gather_device_info repeats step 4 with all devices to find
              the first or largest dev_extent hole that can be allocated on
              each device.
      
              6.  gather_device_info sorts the device list by the hole size
              on each device, using total unallocated space on each device to
              break ties, then returns to btrfs_create_chunk with the list.
      
              7.  btrfs_create_chunk calls decide_stripe_size_regular.
      
              8.  decide_stripe_size_regular finds the largest stripe_len that
              fits across the first nr_devs device dev_extent holes that were
              found by gather_device_info (and satisfies other constraints
              on stripe_len that are not relevant here).
      
              9.  decide_stripe_size_regular caps the length of the stripe it
              computed at 1 GiB.  This cap appeared in 5da431b7 to correct
              one of the other regressions introduced in f6fca391.
      
              10.  btrfs_create_chunk creates a new chunk with the above
              computed size and number of devices.
      
      At step 4, gather_device_info() has found a location where stripe up to
      10 GiB in length could be allocated on several devices, and selected
      which devices should have a dev_extent allocated on them, but at step
      9, only 1 GiB of the space that was found on each device can be used.
      This mismatch causes new suboptimal chunk allocation cases that did not
      occur in pre-f6fca391 kernels.
      
      Consider a filesystem using raid1 profile with 3 devices.  After some
      balances, device 1 has 10x 1 GiB unallocated space, while devices 2
      and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
      space, but distributed across different numbers of dev_extent holes.
      For visualization, let's ignore all the chunks that were allocated before
      this point, and focus on the remaining holes:
      
              Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
              Device 2:  [__________] (10 GiB contig unallocated)
              Device 3:  [__________] (10 GiB contig unallocated)
      
      Before f6fca391, the allocator would fill these optimally by
      allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
      ([13]), or 2 and 3 ([23]):
      
              [after 0 chunk allocations]
              Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
              Device 2:  [__________] (10 GiB)
              Device 3:  [__________] (10 GiB)
      
              [after 1 chunk allocation]
              Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
              Device 2:  [12] [_________] (9 GiB)
              Device 3:  [__________] (10 GiB)
      
              [after 2 chunk allocations]
              Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
              Device 2:  [12] [_________] (9 GiB)
              Device 3:  [13] [_________] (9 GiB)
      
              [after 3 chunk allocations]
              Device 1:  [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
              Device 2:  [12] [12] [________] (8 GiB)
              Device 3:  [13] [_________] (9 GiB)
      
              [...]
      
              [after 12 chunk allocations]
              Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
              Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
              Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
      
              [after 13 chunk allocations]
              Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
              Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
              Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
      
              [after 14 chunk allocations]
              Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
              Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
              Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)
      
              [after 15 chunk allocations]
              Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
              Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
              Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)
      
      This allocates all of the space with no waste.  The sorting function used
      by gather_device_info considers free space holes above 1 GiB in length
      to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
      long hole on each device, all the holes appear equal in the sort, and the
      comparison falls back to sorting devices by total free space.  This keeps
      usable space on each device equal so they can all be filled completely.
      
      After f6fca391, the allocator prefers the devices with larger holes
      over the devices with more free space, so it makes bad allocation choices:
      
              [after 1 chunk allocation]
              Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
              Device 2:  [23] [_________] (9 GiB)
              Device 3:  [23] [_________] (9 GiB)
      
              [after 2 chunk allocations]
              Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
              Device 2:  [23] [23] [________] (8 GiB)
              Device 3:  [23] [23] [________] (8 GiB)
      
              [after 3 chunk allocations]
              Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
              Device 2:  [23] [23] [23] [_______] (7 GiB)
              Device 3:  [23] [23] [23] [_______] (7 GiB)
      
              [...]
      
              [after 9 chunk allocations]
              Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
              Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
              Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
      
              [after 10 chunk allocations]
              Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
              Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
              Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
      
              [after 11 chunk allocations]
              Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
              Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
              Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)
      
      No further allocations are possible, with 8 GiB wasted (4 GiB of data
      space).  The sort in gather_device_info now considers free space in
      holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
      3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
      At that point, with only 1 GiB unallocated on every device, the largest
      hole length on each device is equal at 1 GiB, so the sort finally moves
      to ordering the devices with the most free space, but by this time it
      is too late to make use of the free space on device 1.
      
      Note that it's possible to contrive a case where the pre-f6fca391
      allocator fails the same way, but these cases generally have extensive
      dev_extent fragmentation as a precondition (e.g. many holes of 768M
      in length on one device, and few holes 1 GiB in length on the others).
      With the regression in f6fca391, bad chunk allocation can occur even
      under optimal conditions, when all dev_extent holes are exact multiples
      of stripe_len in length, as in the example above.
      
      Also note that post-f6fca391 kernels do treat dev_extent holes
      larger than 10 GiB as equal, so the bad behavior won't show up on a
      freshly formatted filesystem; however, as the filesystem ages and fills
      up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
      can show up as a failure to balance after adding or removing devices,
      or an unexpected shortfall in available space due to unequal allocation.
      
      To fix the regression and make data chunk allocation work
      again, set ctl->max_stripe_len back to the original SZ_1G, or
      space_info->chunk_size if that's smaller (the latter can happen if the
      user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
      a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).
      
      While researching the background of the earlier commits, I found that an
      identical fix was already proposed at:
      
        https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/
      
      The previous review missed one detail:  ctl->max_stripe_len is used
      before decide_stripe_size_regular() is called, when it is too late for
      the changes in that function to have any effect.  ctl->max_stripe_len is
      not used directly by decide_stripe_size_regular(), but the parameter
      does heavily influence the per-device free space data presented to
      the function.
      
      Fixes: f6fca391 ("btrfs: store chunk size in space-info struct")
      CC: stable@vger.kernel.org # 6.1+
      Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a540e99
  3. 11 Oct, 2023 1 commit
  4. 10 Oct, 2023 1 commit
    • David Sterba's avatar
      Revert "btrfs: reject unknown mount options early" · 54f67dec
      David Sterba authored
      This reverts commit 5f521494.
      
      The patch breaks mounts with security mount options like
      
        $ mount -o context=system_u:object_r:root_t:s0 /dev/sdX /mn
        mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdX, missing codepage or helper program, ...
      
      We cannot reject all unknown options in btrfs_parse_subvol_options() as
      intended, the security options can be present at this point and it's not
      possible to enumerate them in a future proof way. This means unknown
      mount options are silently accepted like before when the filesystem is
      mounted with either -o subvol=/path or as followup mounts of the same
      device.
      
      Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54f67dec
  5. 03 Oct, 2023 6 commits
    • Filipe Manana's avatar
      btrfs: error out when reallocating block for defrag using a stale transaction · e36f9491
      Filipe Manana authored
      At btrfs_realloc_node() we have these checks to verify we are not using a
      stale transaction (a past transaction with an unblocked state or higher),
      and the only thing we do is to trigger two WARN_ON(). This however is a
      critical problem, highly unexpected and if it happens it's most likely due
      to a bug, so we should error out and turn the fs into error state so that
      such issue is much more easily noticed if it's triggered.
      
      The problem is critical because in btrfs_realloc_node() we COW tree blocks,
      and using such stale transaction will lead to not persisting the extent
      buffers used for the COW operations, as allocating tree block adds the
      range of the respective extent buffers to the ->dirty_pages iotree of the
      transaction, and a stale transaction, in the unlocked state or higher,
      will not flush dirty extent buffers anymore, therefore resulting in not
      persisting the tree block and resource leaks (not cleaning the dirty_pages
      iotree for example).
      
      So do the following changes:
      
      1) Return -EUCLEAN if we find a stale transaction;
      
      2) Turn the fs into error state, with error -EUCLEAN, so that no
         transaction can be committed, and generate a stack trace;
      
      3) Combine both conditions into a single if statement, as both are related
         and have the same error message;
      
      4) Mark the check as unlikely, since this is not expected to ever happen.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e36f9491
    • Filipe Manana's avatar
      btrfs: error when COWing block from a root that is being deleted · a2caab29
      Filipe Manana authored
      At btrfs_cow_block() we check if the block being COWed belongs to a root
      that is being deleted and if so we log an error message. However this is
      an unexpected case and it indicates a bug somewhere, so we should return
      an error and abort the transaction. So change this in the following ways:
      
      1) Abort the transaction with -EUCLEAN, so that if the issue ever happens
         it can easily be noticed;
      
      2) Change the logged message level from error to critical, and change the
         message itself to print the block's logical address and the ID of the
         root;
      
      3) Return -EUCLEAN to the caller;
      
      4) As this is an unexpected scenario, that should never happen, mark the
         check as unlikely, allowing the compiler to potentially generate better
         code.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2caab29
    • Filipe Manana's avatar
      btrfs: error out when COWing block using a stale transaction · 48774f3b
      Filipe Manana authored
      At btrfs_cow_block() we have these checks to verify we are not using a
      stale transaction (a past transaction with an unblocked state or higher),
      and the only thing we do is to trigger a WARN with a message and a stack
      trace. This however is a critical problem, highly unexpected and if it
      happens it's most likely due to a bug, so we should error out and turn the
      fs into error state so that such issue is much more easily noticed if it's
      triggered.
      
      The problem is critical because using such stale transaction will lead to
      not persisting the extent buffer used for the COW operation, as allocating
      a tree block adds the range of the respective extent buffer to the
      ->dirty_pages iotree of the transaction, and a stale transaction, in the
      unlocked state or higher, will not flush dirty extent buffers anymore,
      therefore resulting in not persisting the tree block and resource leaks
      (not cleaning the dirty_pages iotree for example).
      
      So do the following changes:
      
      1) Return -EUCLEAN if we find a stale transaction;
      
      2) Turn the fs into error state, with error -EUCLEAN, so that no
         transaction can be committed, and generate a stack trace;
      
      3) Combine both conditions into a single if statement, as both are related
         and have the same error message;
      
      4) Mark the check as unlikely, since this is not expected to ever happen.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48774f3b
    • Filipe Manana's avatar
      btrfs: always print transaction aborted messages with an error level · f8d1b011
      Filipe Manana authored
      Commit b7af0635 ("btrfs: print transaction aborted messages with an
      error level") changed the log level of transaction aborted messages from
      a debug level to an error level, so that such messages are always visible
      even on production systems where the log level is normally above the debug
      level (and also on some syzbot reports).
      
      Later, commit fccf0c84 ("btrfs: move btrfs_abort_transaction to
      transaction.c") changed the log level back to debug level when the error
      number for a transaction abort should not have a stack trace printed.
      This happened for absolutely no reason. It's always useful to print
      transaction abort messages with an error level, regardless of whether
      the error number should cause a stack trace or not.
      
      So change back the log level to error level.
      
      Fixes: fccf0c84 ("btrfs: move btrfs_abort_transaction to transaction.c")
      CC: stable@vger.kernel.org # 6.5+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f8d1b011
    • Qu Wenruo's avatar
      btrfs: reject unknown mount options early · 5f521494
      Qu Wenruo authored
      [BUG]
      The following script would allow invalid mount options to be specified
      (although such invalid options would just be ignored):
      
        # mkfs.btrfs -f $dev
        # mount $dev $mnt1		<<< Successful mount expected
        # mount $dev $mnt2 -o junk	<<< Failed mount expected
        # echo $?
        0
      
      [CAUSE]
      For the 2nd mount, since the fs is already mounted, we won't go through
      open_ctree() thus no btrfs_parse_options(), but only through
      btrfs_parse_subvol_options().
      
      However we do not treat unrecognized options from valid but irrelevant
      options, thus those invalid options would just be ignored by
      btrfs_parse_subvol_options().
      
      [FIX]
      Add the handling for Opt_err to handle invalid options and error out,
      while still ignore other valid options inside btrfs_parse_subvol_options().
      Reported-by: default avatarAnand Jain <anand.jain@oracle.com>
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f521494
    • Josef Bacik's avatar
      btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.c · 9147b9de
      Josef Bacik authored
      Jens reported the following warnings from -Wmaybe-uninitialized recent
      Linus' branch.
      
        In file included from ./include/asm-generic/rwonce.h:26,
      		   from ./arch/arm64/include/asm/rwonce.h:71,
      		   from ./include/linux/compiler.h:246,
      		   from ./include/linux/export.h:5,
      		   from ./include/linux/linkage.h:7,
      		   from ./include/linux/kernel.h:17,
      		   from fs/btrfs/ioctl.c:6:
        In function ‘instrument_copy_from_user_before’,
            inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
            inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
            inlined from ‘btrfs_ioctl_space_info’ at fs/btrfs/ioctl.c:2999:6,
            inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4616:10:
        ./include/linux/kasan-checks.h:38:27: warning: ‘space_args’ may be used
        uninitialized [-Wmaybe-uninitialized]
           38 | #define kasan_check_write __kasan_check_write
        ./include/linux/instrumented.h:129:9: note: in expansion of macro
        ‘kasan_check_write’
          129 |         kasan_check_write(to, n);
      	|         ^~~~~~~~~~~~~~~~~
        ./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
        ./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
        volatile void *’ to ‘__kasan_check_write’ declared here
           20 | bool __kasan_check_write(const volatile void *p, unsigned int
      	size);
      	|      ^~~~~~~~~~~~~~~~~~~
        fs/btrfs/ioctl.c:2981:39: note: ‘space_args’ declared here
         2981 |         struct btrfs_ioctl_space_args space_args;
      	|                                       ^~~~~~~~~~
        In function ‘instrument_copy_from_user_before’,
            inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
            inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
            inlined from ‘_btrfs_ioctl_send’ at fs/btrfs/ioctl.c:4343:9,
            inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4658:10:
        ./include/linux/kasan-checks.h:38:27: warning: ‘args32’ may be used
        uninitialized [-Wmaybe-uninitialized]
           38 | #define kasan_check_write __kasan_check_write
        ./include/linux/instrumented.h:129:9: note: in expansion of macro
        ‘kasan_check_write’
          129 |         kasan_check_write(to, n);
      	|         ^~~~~~~~~~~~~~~~~
        ./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
        ./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
        volatile void *’ to ‘__kasan_check_write’ declared here
           20 | bool __kasan_check_write(const volatile void *p, unsigned int
      	size);
      	|      ^~~~~~~~~~~~~~~~~~~
        fs/btrfs/ioctl.c:4341:49: note: ‘args32’ declared here
         4341 |                 struct btrfs_ioctl_send_args_32 args32;
      	|                                                 ^~~~~~
      
      This was due to his config options and having KASAN turned on,
      which adds some extra checks around copy_from_user(), which then
      triggered the -Wmaybe-uninitialized checker for these cases.
      
      Fix the warnings by initializing the different structs we're copying
      into.
      Reported-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9147b9de
  6. 21 Sep, 2023 2 commits
    • Josef Bacik's avatar
      btrfs: initialize start_slot in btrfs_log_prealloc_extents · b4c639f6
      Josef Bacik authored
      Jens reported a compiler warning when using
      CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this
      
        fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’:
        fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used
        uninitialized [-Wmaybe-uninitialized]
         4828 |                 ret = copy_items(trans, inode, dst_path, path,
      	|                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         4829 |                                  start_slot, ins_nr, 1, 0);
      	|                                  ~~~~~~~~~~~~~~~~~~~~~~~~~
        fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here
         4725 |         int start_slot;
      	|             ^~~~~~~~~~
      
      The compiler is incorrect, as we only use this code when ins_len > 0,
      and when ins_len > 0 we have start_slot properly initialized.  However
      we generally find the -Wmaybe-uninitialized warnings valuable, so
      initialize start_slot to get rid of the warning.
      Reported-by: default avatarJens Axboe <axboe@kernel.dk>
      Tested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4c639f6
    • Josef Bacik's avatar
      btrfs: make sure to initialize start and len in find_free_dev_extent · 20218dfb
      Josef Bacik authored
      Jens reported a compiler error when using CONFIG_CC_OPTIMIZE_FOR_SIZE=y
      that looks like this
      
        In function ‘gather_device_info’,
            inlined from ‘btrfs_create_chunk’ at fs/btrfs/volumes.c:5507:8:
        fs/btrfs/volumes.c:5245:48: warning: ‘dev_offset’ may be used uninitialized [-Wmaybe-uninitialized]
         5245 |                 devices_info[ndevs].dev_offset = dev_offset;
      	|                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
        fs/btrfs/volumes.c: In function ‘btrfs_create_chunk’:
        fs/btrfs/volumes.c:5196:13: note: ‘dev_offset’ was declared here
         5196 |         u64 dev_offset;
      
      This occurs because find_free_dev_extent is responsible for setting
      dev_offset, however if we get an -ENOMEM at the top of the function
      we'll return without setting the value.
      
      This isn't actually a problem because we will see the -ENOMEM in
      gather_device_info() and return and not use the uninitialized value,
      however we also just don't want the compiler warning so rework the code
      slightly in find_free_dev_extent() to make sure it's always setting
      *start and *len to avoid the compiler warning.
      Reported-by: default avatarJens Axboe <axboe@kernel.dk>
      Tested-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20218dfb
  7. 20 Sep, 2023 7 commits
    • Qu Wenruo's avatar
      btrfs: reset destination buffer when read_extent_buffer() gets invalid range · 74ee7914
      Qu Wenruo authored
      Commit f98b6215 ("btrfs: extent_io: do extra check for extent buffer
      read write functions") changed how we handle invalid extent buffer range
      for read_extent_buffer().
      
      Previously if the range is invalid we just set the destination to zero,
      but after the patch we do nothing and error out.
      
      This can lead to smatch static checker errors like:
      
        fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'.
        fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'.
        fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'.
        fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'.
        fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'.
        fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'.
        fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'.
      
      Fix those warnings by reverting back to the old memset() behavior.
      By this we keep the static checker happy and would still make a lot of
      noise when such invalid ranges are passed in.
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Fixes: f98b6215 ("btrfs: extent_io: do extra check for extent buffer read write functions")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74ee7914
    • Josef Bacik's avatar
      btrfs: properly report 0 avail for very full file systems · 58bfe2cc
      Josef Bacik authored
      A user reported some issues with smaller file systems that get very
      full.  While investigating this issue I noticed that df wasn't showing
      100% full, despite having 0 chunk space and having < 1MiB of available
      metadata space.
      
      This turns out to be an overflow issue, we're doing:
      
        total_available_metadata_space - SZ_4M < global_block_rsv_size
      
      to determine if there's not enough space to make metadata allocations,
      which overflows if total_available_metadata_space is < 4M.  Fix this by
      checking to see if our available space is greater than the 4M threshold.
      This makes df properly report 100% usage on the file system.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      58bfe2cc
    • Filipe Manana's avatar
      btrfs: log message if extent item not found when running delayed extent op · 8ec0a4a5
      Filipe Manana authored
      When running a delayed extent operation, if we don't find the extent item
      in the extent tree we just return -EIO without any logged message. This
      indicates some bug or possibly a memory or fs corruption, so the return
      value should not be -EIO but -EUCLEAN instead, and since it's not expected
      to ever happen, print an informative error message so that if it happens
      we have some idea of what went wrong, where to look at.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ec0a4a5
    • Filipe Manana's avatar
      btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref() · d2f79e63
      Filipe Manana authored
      At __btrfs_inc_extent_ref() we are doing a BUG_ON() if we are dealing with
      a tree block reference that has a reference count that is different from 1,
      but we have already dealt with this case at run_delayed_tree_ref(), making
      it useless. So remove the BUG_ON().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2f79e63
    • Filipe Manana's avatar
      btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1 · 1bf76df3
      Filipe Manana authored
      When running a delayed tree reference, if we find a ref count different
      from 1, we return -EIO. This isn't an IO error, as it indicates either a
      bug in the delayed refs code or a memory corruption, so change the error
      code from -EIO to -EUCLEAN. Also tag the branch as 'unlikely' as this is
      not expected to ever happen, and change the error message to print the
      tree block's bytenr without the parenthesis (and there was a missing space
      between the 'block' word and the opening parenthesis), for consistency as
      that's the style we used everywhere else.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bf76df3
    • Filipe Manana's avatar
      btrfs: prevent transaction block reserve underflow when starting transaction · a7ddeeb0
      Filipe Manana authored
      When starting a transaction, with a non-zero number of items, we reserve
      metadata space for that number of items and for delayed refs by doing a
      call to btrfs_block_rsv_add(), with the transaction block reserve passed
      as the block reserve argument. This reserves metadata space and adds it
      to the transaction block reserve. Later we migrate the space we reserved
      for delayed references from the transaction block reserve into the delayed
      refs block reserve, by calling btrfs_migrate_to_delayed_refs_rsv().
      
      btrfs_migrate_to_delayed_refs_rsv() decrements the number of bytes to
      migrate from the source block reserve, and this however may result in an
      underflow in case the space added to the transaction block reserve ended
      up being used by another task that has not reserved enough space for its
      own use - examples are tasks doing reflinks or hole punching because they
      end up calling btrfs_replace_file_extents() -> btrfs_drop_extents() and
      may need to modify/COW a variable number of leaves/paths, so they keep
      trying to use space from the transaction block reserve when they need to
      COW an extent buffer, and may end up trying to use more space then they
      have reserved (1 unit/path only for removing file extent items).
      
      This can be avoided by simply reserving space first without adding it to
      the transaction block reserve, then add the space for delayed refs to the
      delayed refs block reserve and finally add the remaining reserved space
      to the transaction block reserve. This also makes the code a bit shorter
      and simpler. So just do that.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7ddeeb0
    • Filipe Manana's avatar
      btrfs: fix race when refilling delayed refs block reserve · 2ed45c0f
      Filipe Manana authored
      If we have two (or more) tasks attempting to refill the delayed refs block
      reserve we can end up with the delayed block reserve being over reserved,
      that is, with a reserved space greater than its size. If this happens, we
      are holding to more reserved space than necessary for a while.
      
      The race happens like this:
      
      1) The delayed refs block reserve has a size of 8M and a reserved space of
         6M for example;
      
      2) Task A calls btrfs_delayed_refs_rsv_refill();
      
      3) Task B also calls btrfs_delayed_refs_rsv_refill();
      
      4) Task A sees there's a 2M difference between the size and the reserved
         space of the delayed refs rsv, so it will reserve 2M of space by
         calling btrfs_reserve_metadata_bytes();
      
      5) Task B also sees that 2M difference, and like task A, it reserves
         another 2M of metadata space;
      
      6) Both task A and task B increase the reserved space of block reserve
         by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve
         ends up with a size of 8M and a reserved space of 10M;
      
      7) The extra, over reserved space will eventually be freed by some task
         calling btrfs_delayed_refs_rsv_release() -> btrfs_block_rsv_release()
         -> block_rsv_release_bytes(), as there we will detect the over reserve
         and release that space.
      
      So fix this by checking if we still need to add space to the delayed refs
      block reserve after reserving the metadata space, and if we don't, just
      release that space immediately.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ed45c0f
  8. 14 Sep, 2023 3 commits
    • Filipe Manana's avatar
      btrfs: fix race between reading a directory and adding entries to it · 8e7f82de
      Filipe Manana authored
      When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we
      are not holding the directory's inode locked, and this can result in later
      attempting to add two entries to the directory with the same index number,
      resulting in a transaction abort, with -EEXIST (-17), when inserting the
      second delayed dir index. This results in a trace like the following:
      
        Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17)
        Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------
        Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504!
        Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1
        Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018
        Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...)
        Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282
        Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000
        Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540
        Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938
        Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014
        Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef
        Sep 11 22:34:59 myhostname kernel: FS:  00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000
        Sep 11 22:34:59 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0
        Sep 11 22:34:59 myhostname kernel: Call Trace:
        Sep 11 22:34:59 myhostname kernel:  <TASK>
        Sep 11 22:34:59 myhostname kernel:  ? die+0x36/0x90
        Sep 11 22:34:59 myhostname kernel:  ? do_trap+0xda/0x100
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? do_error_trap+0x6a/0x90
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? exc_invalid_op+0x50/0x70
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? asm_exc_invalid_op+0x1a/0x20
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  btrfs_insert_dir_item+0x200/0x280
        Sep 11 22:34:59 myhostname kernel:  btrfs_add_link+0xab/0x4f0
        Sep 11 22:34:59 myhostname kernel:  ? ktime_get_real_ts64+0x47/0xe0
        Sep 11 22:34:59 myhostname kernel:  btrfs_create_new_inode+0x7cd/0xa80
        Sep 11 22:34:59 myhostname kernel:  btrfs_symlink+0x190/0x4d0
        Sep 11 22:34:59 myhostname kernel:  ? schedule+0x5e/0xd0
        Sep 11 22:34:59 myhostname kernel:  ? __d_lookup+0x7e/0xc0
        Sep 11 22:34:59 myhostname kernel:  vfs_symlink+0x148/0x1e0
        Sep 11 22:34:59 myhostname kernel:  do_symlinkat+0x130/0x140
        Sep 11 22:34:59 myhostname kernel:  __x64_sys_symlinkat+0x3d/0x50
        Sep 11 22:34:59 myhostname kernel:  do_syscall_64+0x5d/0x90
        Sep 11 22:34:59 myhostname kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
        Sep 11 22:34:59 myhostname kernel:  ? do_syscall_64+0x6c/0x90
        Sep 11 22:34:59 myhostname kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      The race leading to the problem happens like this:
      
      1) Directory inode X is loaded into memory, its ->index_cnt field is
         initialized to (u64)-1 (at btrfs_alloc_inode());
      
      2) Task A is adding a new file to directory X, holding its vfs inode lock,
         and calls btrfs_set_inode_index() to get an index number for the entry.
      
         Because the inode's index_cnt field is set to (u64)-1 it calls
         btrfs_inode_delayed_dir_index_count() which fails because no dir index
         entries were added yet to the delayed inode and then it calls
         btrfs_set_inode_index_count(). This functions finds the last dir index
         key and then sets index_cnt to that index value + 1. It found that the
         last index key has an offset of 100. However before it assigns a value
         of 101 to index_cnt...
      
      3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS
         lock for inode X is not taken, so it calls btrfs_get_dir_last_index()
         and sees index_cnt still with a value of (u64)-1. Because of that it
         calls btrfs_inode_delayed_dir_index_count() which fails since no dir
         index entries were added to the delayed inode yet, and then it also
         calls btrfs_set_inode_index_count(). This also finds that the last
         index key has an offset of 100, and before it assigns the value 101
         to the index_cnt field of inode X...
      
      4) Task A assigns a value of 101 to index_cnt. And then the code flow
         goes to btrfs_set_inode_index() where it increments index_cnt from
         101 to 102. Task A then creates a delayed dir index entry with a
         sequence number of 101 and adds it to the delayed inode;
      
      5) Task B assigns 101 to the index_cnt field of inode X;
      
      6) At some later point when someone tries to add a new entry to the
         directory, btrfs_set_inode_index() will return 101 again and shortly
         after an attempt to add another delayed dir index key with index
         number 101 will fail with -EEXIST resulting in a transaction abort.
      
      Fix this by locking the inode at btrfs_get_dir_last_index(), which is only
      only used when opening a directory or attempting to lseek on it.
      Reported-by: default avatarken <ken@bllue.org>
      Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/
      Reported-by: syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
      Fixes: 9b378f6a ("btrfs: fix infinite directory reads")
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e7f82de
    • Filipe Manana's avatar
      btrfs: refresh dir last index during a rewinddir(3) call · e60aa5da
      Filipe Manana authored
      When opening a directory we find what's the index of its last entry and
      then store it in the directory's file handle private data (struct
      btrfs_file_private::last_index), so that in the case new directory entries
      are added to a directory after an opendir(3) call we don't end up in an
      infinite loop (see commit 9b378f6a ("btrfs: fix infinite directory
      reads")) when calling readdir(3).
      
      However once rewinddir(3) is called, POSIX states [1] that any new
      directory entries added after the previous opendir(3) call, must be
      returned by subsequent calls to readdir(3):
      
        "The rewinddir() function shall reset the position of the directory
         stream to which dirp refers to the beginning of the directory.
         It shall also cause the directory stream to refer to the current
         state of the corresponding directory, as a call to opendir() would
         have done."
      
      We currently don't refresh the last_index field of the struct
      btrfs_file_private associated to the directory, so after a rewinddir(3)
      we are not returning any new entries added after the opendir(3) call.
      
      Fix this by finding the current last index of the directory when llseek
      is called against the directory.
      
      This can be reproduced by the following C program provided by Ian Johnson:
      
         #include <dirent.h>
         #include <stdio.h>
      
         int main(void) {
           DIR *dir = opendir("test");
      
           FILE *file;
           file = fopen("test/1", "w");
           fwrite("1", 1, 1, file);
           fclose(file);
      
           file = fopen("test/2", "w");
           fwrite("2", 1, 1, file);
           fclose(file);
      
           rewinddir(dir);
      
           struct dirent *entry;
           while ((entry = readdir(dir))) {
              printf("%s\n", entry->d_name);
           }
           closedir(dir);
           return 0;
         }
      Reported-by: default avatarIan Johnson <ian@ianjohnson.dev>
      Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
      Fixes: 9b378f6a ("btrfs: fix infinite directory reads")
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e60aa5da
    • Filipe Manana's avatar
      btrfs: set last dir index to the current last index when opening dir · 35795036
      Filipe Manana authored
      When opening a directory for reading it, we set the last index where we
      stop iteration to the value in struct btrfs_inode::index_cnt. That value
      does not match the index of the most recently added directory entry but
      it's instead the index number that will be assigned the next directory
      entry.
      
      This means that if after the call to opendir(3) new directory entries are
      added, a readdir(3) call will return the first new directory entry. This
      is fine because POSIX says the following [1]:
      
        "If a file is removed from or added to the directory after the most
         recent call to opendir() or rewinddir(), whether a subsequent call to
         readdir() returns an entry for that file is unspecified."
      
      For example for the test script from commit 9b378f6a ("btrfs: fix
      infinite directory reads"), where we have 2000 files in a directory, ext4
      doesn't return any new directory entry after opendir(3), while xfs returns
      the first 13 new directory entries added after the opendir(3) call.
      
      If we move to a shorter example with an empty directory when opendir(3) is
      called, and 2 files added to the directory after the opendir(3) call, then
      readdir(3) on btrfs will return the first file, ext4 and xfs return the 2
      files (but in a different order). A test program for this, reported by
      Ian Johnson, is the following:
      
         #include <dirent.h>
         #include <stdio.h>
      
         int main(void) {
           DIR *dir = opendir("test");
      
           FILE *file;
           file = fopen("test/1", "w");
           fwrite("1", 1, 1, file);
           fclose(file);
      
           file = fopen("test/2", "w");
           fwrite("2", 1, 1, file);
           fclose(file);
      
           struct dirent *entry;
           while ((entry = readdir(dir))) {
              printf("%s\n", entry->d_name);
           }
           closedir(dir);
           return 0;
         }
      
      To make this less odd, change the behaviour to never return new entries
      that were added after the opendir(3) call. This is done by setting the
      last_index field of the struct btrfs_file_private attached to the
      directory's file handle with a value matching btrfs_inode::index_cnt
      minus 1, since that value always matches the index of the next new
      directory entry and not the index of the most recently added entry.
      
      [1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html
      
      Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35795036
  9. 13 Sep, 2023 3 commits
    • Josef Bacik's avatar
      btrfs: don't clear uptodate on write errors · b595d259
      Josef Bacik authored
      We have been consistently seeing hangs with generic/648 in our subpage
      GitHub CI setup.  This is a classic deadlock, we are calling
      btrfs_read_folio() on a folio, which requires holding the folio lock on
      the folio, and then finding a ordered extent that overlaps that range
      and calling btrfs_start_ordered_extent(), which then tries to write out
      the dirty page, which requires taking the folio lock and then we
      deadlock.
      
      The hang happens because we're writing to range [1271750656, 1271767040),
      page index [77621, 77622], and page 77621 is !Uptodate.  It is also Dirty,
      so we call btrfs_read_folio() for 77621 and which does
      btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
      extent which is [1271644160, 1271746560), page index [77615, 77621].
      The page indexes overlap, but the actual bytes don't overlap.  We're
      holding the page lock for 77621, then call
      btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
      page, and tries to lock 77621 again and then we deadlock.
      
      The byte ranges do not overlap, but with subpage support if we clear
      uptodate on any portion of the page we mark the entire thing as not
      uptodate.
      
      We have been clearing page uptodate on write errors, but no other file
      system does this, and is in fact incorrect.  This doesn't hurt us in the
      !subpage case because we can't end up with overlapped ranges that don't
      also overlap on the page.
      
      Fix this by not clearing uptodate when we have a write error.  The only
      thing we should be doing in this case is setting the mapping error and
      carrying on.  This makes it so we would no longer call
      btrfs_read_folio() on the page as it's uptodate and eliminates the
      deadlock.
      
      With this patch we're now able to make it through a full fstests run on
      our subpage blocksize VMs.
      
      Note for stable backports: this probably goes beyond 6.1 but the code
      has been cleaned up and clearing the uptodate bit must be verified on
      each version independently.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b595d259
    • Bernd Schubert's avatar
      btrfs: file_remove_privs needs an exclusive lock in direct io write · 9af86694
      Bernd Schubert authored
      This was noticed by Miklos that file_remove_privs might call into
      notify_change(), which requires to hold an exclusive lock. The problem
      exists in FUSE and btrfs. We can fix it without any additional helpers
      from VFS, in case the privileges would need to be dropped, change the
      lock type to be exclusive and redo the loop.
      
      Fixes: e9adabb9 ("btrfs: use shared lock for direct writes within EOF")
      CC: Miklos Szeredi <miklos@szeredi.hu>
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBernd Schubert <bschubert@ddn.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9af86694
    • Matthew Wilcox (Oracle)'s avatar
      btrfs: convert btrfs_read_merkle_tree_page() to use a folio · 06ed0935
      Matthew Wilcox (Oracle) authored
      Remove a number of hidden calls to compound_head() by using a folio
      throughout.  Also follow core kernel coding style by adding the folio to
      the page cache immediately after allocation instead of doing the read
      first, then adding it to the page cache.  This ordering makes subsequent
      readers block waiting for the first reader instead of duplicating the
      work only to throw it away when they find out they lost the race.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06ed0935
  10. 08 Sep, 2023 10 commits
    • Bhaskar Chowdhury's avatar
      MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org · 5facccc9
      Bhaskar Chowdhury authored
      The wiki has been archived and is not updated anymore. Remove or replace
      the links in files that contain it (MAINTAINERS, Kconfig, docs).
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5facccc9
    • Filipe Manana's avatar
      btrfs: assert delayed node locked when removing delayed item · a57c2d4e
      Filipe Manana authored
      When removing a delayed item, or releasing which will remove it as well,
      we will modify one of the delayed node's rbtrees and item counter if the
      delayed item is in one of the rbtrees. This require having the delayed
      node's mutex locked, otherwise we will race with other tasks modifying
      the rbtrees and the counter.
      
      This is motivated by a previous version of another patch actually calling
      btrfs_release_delayed_item() after unlocking the delayed node's mutex and
      against a delayed item that is in a rbtree.
      
      So assert at __btrfs_remove_delayed_item() that the delayed node's mutex
      is locked.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a57c2d4e
    • Filipe Manana's avatar
      btrfs: remove BUG() after failure to insert delayed dir index item · 2c58c393
      Filipe Manana authored
      Instead of calling BUG() when we fail to insert a delayed dir index item
      into the delayed node's tree, we can just release all the resources we
      have allocated/acquired before and return the error to the caller. This is
      fine because all existing call chains undo anything they have done before
      calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending
      snapshots in the transaction commit path).
      
      So remove the BUG() call and do proper error handling.
      
      This relates to a syzbot report linked below, but does not fix it because
      it only prevents hitting a BUG(), it does not fix the issue where somehow
      we attempt to use twice the same index number for different index items.
      
      Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c58c393
    • Filipe Manana's avatar
      btrfs: improve error message after failure to add delayed dir index item · 91bfe310
      Filipe Manana authored
      If we fail to add a delayed dir index item because there's already another
      item with the same index number, we print an error message (and then BUG).
      However that message isn't very helpful to debug anything because we don't
      know what's the index number and what are the values of index counters in
      the inode and its delayed inode (index_cnt fields of struct btrfs_inode
      and struct btrfs_delayed_node).
      
      So update the error message to include the index number and counters.
      
      We actually had a recent case where this issue was hit by a syzbot report
      (see the link below).
      
      Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91bfe310
    • Qu Wenruo's avatar
      btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio · 5e0e8799
      Qu Wenruo authored
      [BUG]
      After commit 72a69cd0 ("btrfs: subpage: pack all subpage bitmaps
      into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
      no longer compile.
      
      [CAUSE]
      If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
      mostly to make sure the range we marked dirty has an extent buffer and
      that extent buffer is dirty.
      
      For subpage, we need to iterate through all the extent buffers covered
      by that page range, and make sure they all matches the criteria.
      
      However commit 72a69cd0 ("btrfs: subpage: pack all subpage bitmaps
      into a larger bitmap") changes how we store the bitmap, we pack all the
      16 bits bitmaps into a larger bitmap, which would save some space.
      
      This means we no longer have btrfs_subpage::dirty_bitmap, instead the
      dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
      length of btrfs_subpage_info::bitmap_nr_bits.
      
      [FIX]
      Although I'm not sure if it still makes sense to maintain such code, at
      least let it compile.
      
      This patch would let us test the bits one by one through the bitmaps.
      
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e0e8799
    • Josef Bacik's avatar
      btrfs: check for BTRFS_FS_ERROR in pending ordered assert · 4ca8e03c
      Josef Bacik authored
      If we do fast tree logging we increment a counter on the current
      transaction for every ordered extent we need to wait for.  This means we
      expect the transaction to still be there when we clear pending on the
      ordered extent.  However if we happen to abort the transaction and clean
      it up, there could be no running transaction, and thus we'll trip the
      "ASSERT(trans)" check.  This is obviously incorrect, and the code
      properly deals with the case that the transaction doesn't exist.  Fix
      this ASSERT() to only fire if there's no trans and we don't have
      BTRFS_FS_ERROR() set on the file system.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ca8e03c
    • Filipe Manana's avatar
      btrfs: fix lockdep splat and potential deadlock after failure running delayed items · e110f891
      Filipe Manana authored
      When running delayed items we are holding a delayed node's mutex and then
      we will attempt to modify a subvolume btree to insert/update/delete the
      delayed items. However if have an error during the insertions for example,
      btrfs_insert_delayed_items() may return with a path that has locked extent
      buffers (a leaf at the very least), and then we attempt to release the
      delayed node at __btrfs_run_delayed_items(), which requires taking the
      delayed node's mutex, causing an ABBA type of deadlock. This was reported
      by syzbot and the lockdep splat is the following:
      
        WARNING: possible circular locking dependency detected
        6.5.0-rc7-syzkaller-00024-g93f5de5f #0 Not tainted
        ------------------------------------------------------
        syz-executor.2/13257 is trying to acquire lock:
        ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
      
        but task is already holding lock:
        ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-tree-00){++++}-{3:3}:
               __lock_release kernel/locking/lockdep.c:5475 [inline]
               lock_release+0x36f/0x9d0 kernel/locking/lockdep.c:5781
               up_write+0x79/0x580 kernel/locking/rwsem.c:1625
               btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
               btrfs_unlock_up_safe+0x179/0x3b0 fs/btrfs/locking.c:239
               search_leaf fs/btrfs/ctree.c:1986 [inline]
               btrfs_search_slot+0x2511/0x2f80 fs/btrfs/ctree.c:2230
               btrfs_insert_empty_items+0x9c/0x180 fs/btrfs/ctree.c:4376
               btrfs_insert_delayed_item fs/btrfs/delayed-inode.c:746 [inline]
               btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
               __btrfs_commit_inode_delayed_items+0xd24/0x2410 fs/btrfs/delayed-inode.c:1111
               __btrfs_run_delayed_items+0x1db/0x430 fs/btrfs/delayed-inode.c:1153
               flush_space+0x269/0xe70 fs/btrfs/space-info.c:723
               btrfs_async_reclaim_metadata_space+0x106/0x350 fs/btrfs/space-info.c:1078
               process_one_work+0x92c/0x12c0 kernel/workqueue.c:2600
               worker_thread+0xa63/0x1210 kernel/workqueue.c:2751
               kthread+0x2b8/0x350 kernel/kthread.c:389
               ret_from_fork+0x2e/0x60 arch/x86/kernel/process.c:145
               ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
               check_prev_add kernel/locking/lockdep.c:3142 [inline]
               check_prevs_add kernel/locking/lockdep.c:3261 [inline]
               validate_chain kernel/locking/lockdep.c:3876 [inline]
               __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
               lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
               __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603
               __mutex_lock kernel/locking/mutex.c:747 [inline]
               mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799
               __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
               btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
               __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156
               btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276
               btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988
               vfs_fsync_range fs/sync.c:188 [inline]
               vfs_fsync fs/sync.c:202 [inline]
               do_fsync fs/sync.c:212 [inline]
               __do_sys_fsync fs/sync.c:220 [inline]
               __se_sys_fsync fs/sync.c:218 [inline]
               __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218
               do_syscall_x64 arch/x86/entry/common.c:50 [inline]
               do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(btrfs-tree-00);
                                       lock(&delayed_node->mutex);
                                       lock(btrfs-tree-00);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by syz-executor.2/13257:
         #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: spin_unlock include/linux/spinlock.h:391 [inline]
         #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0xb87/0xe00 fs/btrfs/transaction.c:287
         #1: ffff88802c1ee398 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0xbb2/0xe00 fs/btrfs/transaction.c:288
         #2: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198
      
        stack backtrace:
        CPU: 0 PID: 13257 Comm: syz-executor.2 Not tainted 6.5.0-rc7-syzkaller-00024-g93f5de5f #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
         check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195
         check_prev_add kernel/locking/lockdep.c:3142 [inline]
         check_prevs_add kernel/locking/lockdep.c:3261 [inline]
         validate_chain kernel/locking/lockdep.c:3876 [inline]
         __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
         lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
         __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603
         __mutex_lock kernel/locking/mutex.c:747 [inline]
         mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799
         __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
         btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
         __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156
         btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276
         btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988
         vfs_fsync_range fs/sync.c:188 [inline]
         vfs_fsync fs/sync.c:202 [inline]
         do_fsync fs/sync.c:212 [inline]
         __do_sys_fsync fs/sync.c:220 [inline]
         __se_sys_fsync fs/sync.c:218 [inline]
         __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
        RIP: 0033:0x7f3ad047cae9
        Code: 28 00 00 00 75 (...)
        RSP: 002b:00007f3ad12510c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 00007f3ad059bf80 RCX: 00007f3ad047cae9
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
        RBP: 00007f3ad04c847a R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        R13: 000000000000000b R14: 00007f3ad059bf80 R15: 00007ffe56af92f8
         </TASK>
        ------------[ cut here ]------------
      
      Fix this by releasing the path before releasing the delayed node in the
      error path at __btrfs_run_delayed_items().
      
      Reported-by: syzbot+a379155f07c134ea9879@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000abba27060403b5bd@google.com/
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e110f891
    • Josef Bacik's avatar
      btrfs: do not block starts waiting on previous transaction commit · 77d20c68
      Josef Bacik authored
      Internally I got a report of very long stalls on normal operations like
      creating a new file when auto relocation was running.  The reporter used
      the 'bpf offcputime' tracer to show that we would get stuck in
      start_transaction for 5 to 30 seconds, and were always being woken up by
      the transaction commit.
      
      Using my timing-everything script, which times how long a function takes
      and what percentage of that total time is taken up by its children, I
      saw several traces like this
      
      1083 took 32812902424 ns
              29929002926 ns 91.2110% wait_for_commit_duration
              25568 ns 7.7920e-05% commit_fs_roots_duration
              1007751 ns 0.00307% commit_cowonly_roots_duration
              446855602 ns 1.36182% btrfs_run_delayed_refs_duration
              271980 ns 0.00082% btrfs_run_delayed_items_duration
              2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration
              9656 ns 2.9427e-05% switch_commit_roots_duration
              1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration
              4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration
      
      Here I was only tracing functions that happen where we are between
      START_COMMIT and UNBLOCKED in order to see what would be keeping us
      blocked for so long.  The wait_for_commit() we do is where we wait for a
      previous transaction that hasn't completed it's commit.  This can
      include all of the unpin work and other cleanups, which tends to be the
      longest part of our transaction commit.
      
      There is no reason we should be blocking new things from entering the
      transaction at this point, it just adds to random latency spikes for no
      reason.
      
      Fix this by adding a PREP stage.  This allows us to properly deal with
      multiple committers coming in at the same time, we retain the behavior
      that the winner waits on the previous transaction and the losers all
      wait for this transaction commit to occur.  Nothing else is blocked
      during the PREP stage, and then once the wait is complete we switch to
      COMMIT_START and all of the same behavior as before is maintained.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77d20c68
    • Filipe Manana's avatar
      btrfs: release path before inode lookup during the ino lookup ioctl · ee34a82e
      Filipe Manana authored
      During the ino lookup ioctl we can end up calling btrfs_iget() to get an
      inode reference while we are holding on a root's btree. If btrfs_iget()
      needs to lookup the inode from the root's btree, because it's not
      currently loaded in memory, then it will need to lock another or the
      same path in the same root btree. This may result in a deadlock and
      trigger the following lockdep splat:
      
        WARNING: possible circular locking dependency detected
        6.5.0-rc7-syzkaller-00004-gf7757129 #0 Not tainted
        ------------------------------------------------------
        syz-executor277/5012 is trying to acquire lock:
        ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
      
        but task is already holding lock:
        ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-tree-00){++++}-{3:3}:
               down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
               __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
               btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302
               btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955
               btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline]
               btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338
               btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline]
               open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494
               btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154
               btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519
               legacy_get_tree+0xef/0x190 fs/fs_context.c:611
               vfs_get_tree+0x8c/0x270 fs/super.c:1519
               fc_mount fs/namespace.c:1112 [inline]
               vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142
               btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579
               legacy_get_tree+0xef/0x190 fs/fs_context.c:611
               vfs_get_tree+0x8c/0x270 fs/super.c:1519
               do_new_mount+0x28f/0xae0 fs/namespace.c:3335
               do_mount fs/namespace.c:3675 [inline]
               __do_sys_mount fs/namespace.c:3884 [inline]
               __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
               do_syscall_x64 arch/x86/entry/common.c:50 [inline]
               do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-tree-01){++++}-{3:3}:
               check_prev_add kernel/locking/lockdep.c:3142 [inline]
               check_prevs_add kernel/locking/lockdep.c:3261 [inline]
               validate_chain kernel/locking/lockdep.c:3876 [inline]
               __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
               lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
               down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
               __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
               btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
               btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
               btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
               btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
               btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
               btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
               btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
               btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
               btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
               btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
               vfs_ioctl fs/ioctl.c:51 [inline]
               __do_sys_ioctl fs/ioctl.c:870 [inline]
               __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
               do_syscall_x64 arch/x86/entry/common.c:50 [inline]
               do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          rlock(btrfs-tree-00);
                                       lock(btrfs-tree-01);
                                       lock(btrfs-tree-00);
          rlock(btrfs-tree-01);
      
         *** DEADLOCK ***
      
        1 lock held by syz-executor277/5012:
         #0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
      
        stack backtrace:
        CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129 #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
         check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195
         check_prev_add kernel/locking/lockdep.c:3142 [inline]
         check_prevs_add kernel/locking/lockdep.c:3261 [inline]
         validate_chain kernel/locking/lockdep.c:3876 [inline]
         __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
         lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
         down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
         __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
         btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
         btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
         btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
         btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
         btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
         btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
         btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
         btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
         btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:870 [inline]
         __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
        RIP: 0033:0x7f0bec94ea39
      
      Fix this simply by releasing the path before calling btrfs_iget() as at
      point we don't need the path anymore.
      
      Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/
      Fixes: 23d0b79d ("btrfs: Add unprivileged version of ino_lookup ioctl")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee34a82e
    • Filipe Manana's avatar
      btrfs: fix race between finishing block group creation and its item update · 2d6cd791
      Filipe Manana authored
      Commit 675dfe12 ("btrfs: fix block group item corruption after
      inserting new block group") fixed one race that resulted in not persisting
      a block group's item when its "used" bytes field decreases to zero.
      However there's another race that can happen in a much shorter time window
      that results in the same problem. The following sequence of steps explains
      how it can happen:
      
      1) Task A creates a metadata block group X, its "used" and "commit_used"
         fields are initialized to 0;
      
      2) Two extents are allocated from block group X, so its "used" field is
         updated to 32K, and its "commit_used" field remains as 0;
      
      3) Transaction commit starts, by some task B, and it enters
         btrfs_start_dirty_block_groups(). There it tries to update the block
         group item for block group X, which currently has its "used" field with
         a value of 32K and its "commit_used" field with a value of 0. However
         that fails since the block group item was not yet inserted, so at
         update_block_group_item(), the btrfs_search_slot() call returns 1, and
         then we set 'ret' to -ENOENT. Before jumping to the label 'fail'...
      
      4) The block group item is inserted by task A, when for example
         btrfs_create_pending_block_groups() is called when releasing its
         transaction handle. This results in insert_block_group_item() inserting
         the block group item in the extent tree (or block group tree), with a
         "used" field having a value of 32K and setting "commit_used", in struct
         btrfs_block_group, to the same value (32K);
      
      5) Task B jumps to the 'fail' label and then resets the "commit_used"
         field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was
         returned from update_block_group_item(), we add the block group again
         to the list of dirty block groups, so that we will try again in the
         critical section of the transaction commit when calling
         btrfs_write_dirty_block_groups();
      
      6) Later the two extents from block group X are freed, so its "used" field
         becomes 0;
      
      7) If no more extents are allocated from block group X before we get into
         btrfs_write_dirty_block_groups(), then when we call
         update_block_group_item() again for block group X, we will not update
         the block group item to reflect that it has 0 bytes used, because the
         "used" and "commit_used" fields in struct btrfs_block_group have the
         same value, a value of 0.
      
         As a result after committing the transaction we have an empty block
         group with its block group item having a 32K value for its "used" field.
         This will trigger errors from fsck ("btrfs check" command) and after
         mounting again the fs, the cleaner kthread will not automatically delete
         the empty block group, since its "used" field is not 0. Possibly there
         are other issues due to this inconsistency.
      
         When this issue happens, the error reported by fsck is like this:
      
           [1/7] checking root items
           [2/7] checking extents
           block group [1104150528 1073741824] used 39796736 but extent items used 0
           ERROR: errors found in extent allocation tree or chunk allocation
           (...)
      
      So fix this by not resetting the "commit_used" field of a block group when
      we don't find the block group item at update_block_group_item().
      
      Fixes: 7248e0ce ("btrfs: skip update of block group item if used bytes are the same")
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d6cd791
  11. 22 Aug, 2023 1 commit
    • Naohiro Aota's avatar
      btrfs: zoned: skip splitting and logical rewriting on pre-alloc write · c02d35d8
      Naohiro Aota authored
      When doing a relocation, there is a chance that at the time of
      btrfs_reloc_clone_csums(), there is no checksum for the corresponding
      region.
      
      In this case, btrfs_finish_ordered_zoned()'s sum points to an invalid item
      and so ordered_extent's logical is set to some invalid value. Then,
      btrfs_lookup_block_group() in btrfs_zone_finish_endio() failed to find a
      block group and will hit an assert or a null pointer dereference as
      following.
      
      This can be reprodcued by running btrfs/028 several times (e.g, 4 to 16
      times) with a null_blk setup. The device's zone size and capacity is set to
      32 MB and the storage size is set to 5 GB on my setup.
      
          KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
          CPU: 6 PID: 3105720 Comm: kworker/u16:13 Tainted: G        W          6.5.0-rc6-kts+ #1
          Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
          Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
          RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
          Code: 41 54 49 89 fc 55 48 89 f5 53 e8 57 7d fc ff 48 8d b8 88 00 00 00 48 89 c3 48 b8 00 00 00 00 00
          > 3c 02 00 0f 85 02 01 00 00 f6 83 88 00 00 00 01 0f 84 a8 00 00
          RSP: 0018:ffff88833cf87b08 EFLAGS: 00010206
          RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
          RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088
          RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102877b827
          R10: ffff888143bdc13b R11: ffff888125b1cbc0 R12: ffff888143bdc000
          R13: 0000000000007000 R14: ffff888125b1cba8 R15: 0000000000000000
          FS:  0000000000000000(0000) GS:ffff88881e500000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007f3ed85223d5 CR3: 00000001519b4005 CR4: 00000000001706e0
          Call Trace:
           <TASK>
           ? die_addr+0x3c/0xa0
           ? exc_general_protection+0x148/0x220
           ? asm_exc_general_protection+0x22/0x30
           ? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
           ? btrfs_zone_finish_endio.part.0+0x19/0x160 [btrfs]
           btrfs_finish_one_ordered+0x7b8/0x1de0 [btrfs]
           ? rcu_is_watching+0x11/0xb0
           ? lock_release+0x47a/0x620
           ? btrfs_finish_ordered_zoned+0x59b/0x800 [btrfs]
           ? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs]
           ? btrfs_finish_ordered_zoned+0x358/0x800 [btrfs]
           ? __smp_call_single_queue+0x124/0x350
           ? rcu_is_watching+0x11/0xb0
           btrfs_work_helper+0x19f/0xc60 [btrfs]
           ? __pfx_try_to_wake_up+0x10/0x10
           ? _raw_spin_unlock_irq+0x24/0x50
           ? rcu_is_watching+0x11/0xb0
           process_one_work+0x8c1/0x1430
           ? __pfx_lock_acquire+0x10/0x10
           ? __pfx_process_one_work+0x10/0x10
           ? __pfx_do_raw_spin_lock+0x10/0x10
           ? _raw_spin_lock_irq+0x52/0x60
           worker_thread+0x100/0x12c0
           ? __kthread_parkme+0xc1/0x1f0
           ? __pfx_worker_thread+0x10/0x10
           kthread+0x2ea/0x3c0
           ? __pfx_kthread+0x10/0x10
           ret_from_fork+0x30/0x70
           ? __pfx_kthread+0x10/0x10
           ret_from_fork_asm+0x1b/0x30
           </TASK>
      
      On the zoned mode, writing to pre-allocated region means data relocation
      write. Such write always uses WRITE command so there is no need of splitting
      and rewriting logical address. Thus, we can just skip the function for the
      case.
      
      Fixes: cbfce4c7 ("btrfs: optimize the logical to physical mapping for zoned writes")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c02d35d8
  12. 21 Aug, 2023 4 commits