1. 01 Oct, 2024 4 commits
    • Filipe Manana's avatar
      btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class · 50c6f6e6
      Filipe Manana authored
      While running checkpatch.pl against a patch that modifies the
      btrfs_qgroup_extent event class, it complained about using a comma instead
      of a semicolon:
      
        $ ./scripts/checkpatch.pl qgroups/0003-btrfs-qgroups-remove-bytenr-field-from-struct-btrfs_.patch
        WARNING: Possible comma where semicolon could be used
        #215: FILE: include/trace/events/btrfs.h:1720:
        +		__entry->bytenr		= bytenr,
      		__entry->num_bytes	= rec->num_bytes;
      
        total: 0 errors, 1 warnings, 184 lines checked
      
      So replace the comma with a semicolon to silence checkpatch and possibly
      other tools. It also makes the code consistent with the rest.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50c6f6e6
    • Josef Bacik's avatar
      btrfs: drop the backref cache during relocation if we commit · db7e68b5
      Josef Bacik authored
      Since the inception of relocation we have maintained the backref cache
      across transaction commits, updating the backref cache with the new
      bytenr whenever we COWed blocks that were in the cache, and then
      updating their bytenr once we detected a transaction id change.
      
      This works as long as we're only ever modifying blocks, not changing the
      structure of the tree.
      
      However relocation does in fact change the structure of the tree.  For
      example, if we are relocating a data extent, we will look up all the
      leaves that point to this data extent.  We will then call
      do_relocation() on each of these leaves, which will COW down to the leaf
      and then update the file extent location.
      
      But, a key feature of do_relocation() is the pending list.  This is all
      the pending nodes that we modified when we updated the file extent item.
      We will then process all of these blocks via finish_pending_nodes, which
      calls do_relocation() on all of the nodes that led up to that leaf.
      
      The purpose of this is to make sure we don't break sharing unless we
      absolutely have to.  Consider the case that we have 3 snapshots that all
      point to this leaf through the same nodes, the initial COW would have
      created a whole new path.  If we did this for all 3 snapshots we would
      end up with 3x the number of nodes we had originally.  To avoid this we
      will cycle through each of the snapshots that point to each of these
      nodes and update their pointers to point at the new nodes.
      
      Once we update the pointer to the new node we will drop the node we
      removed the link for and all of its children via btrfs_drop_subtree().
      This is essentially just btrfs_drop_snapshot(), but for an arbitrary
      point in the snapshot.
      
      The problem with this is that we will never reflect this in the backref
      cache.  If we do this btrfs_drop_snapshot() for a node that is in the
      backref tree, we will leave the node in the backref tree.  This becomes
      a problem when we change the transid, as now the backref cache has
      entire subtrees that no longer exist, but exist as if they still are
      pointed to by the same roots.
      
      In the best case scenario you end up with "adding refs to an existing
      tree ref" errors from insert_inline_extent_backref(), where we attempt
      to link in nodes on roots that are no longer valid.
      
      Worst case you will double free some random block and re-use it when
      there's still references to the block.
      
      This is extremely subtle, and the consequences are quite bad.  There
      isn't a way to make sure our backref cache is consistent between
      transid's.
      
      In order to fix this we need to simply evict the entire backref cache
      anytime we cross transid's.  This reduces performance in that we have to
      rebuild this backref cache every time we change transid's, but fixes the
      bug.
      
      This has existed since relocation was added, and is a pretty critical
      bug.  There's a lot more cleanup that can be done now that this
      functionality is going away, but this patch is as small as possible in
      order to fix the problem and make it easy for us to backport it to all
      the kernels it needs to be backported to.
      
      Followup series will dismantle more of this code and simplify relocation
      drastically to remove this functionality.
      
      We have a reproducer that reproduced the corruption within a few minutes
      of running.  With this patch it survives several iterations/hours of
      running the reproducer.
      
      Fixes: 3fd0a558 ("Btrfs: Metadata ENOSPC handling for balance")
      CC: stable@vger.kernel.org
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db7e68b5
    • Johannes Thumshirn's avatar
      btrfs: also add stripe entries for NOCOW writes · 97f97822
      Johannes Thumshirn authored
      NOCOW writes do not generate stripe_extent entries in the RAID stripe
      tree, as the RAID stripe-tree feature initially was designed with a
      zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
      writes. But the RAID stripe-tree feature is independent from the zoned
      feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97f97822
    • Filipe Manana's avatar
      btrfs: send: fix buffer overflow detection when copying path to cache entry · 96c6ca71
      Filipe Manana authored
      Starting with commit c0247d28 ("btrfs: send: annotate struct
      name_cache_entry with __counted_by()") we annotated the variable length
      array "name" from the name_cache_entry structure with __counted_by() to
      improve overflow detection. However that alone was not correct, because
      the length of that array does not match the "name_len" field - it matches
      that plus 1 to include the NUL string terminator, so that makes a
      fortified kernel think there's an overflow and report a splat like this:
      
        strcpy: detected buffer overflow: 20 byte write of buffer size 19
        WARNING: CPU: 3 PID: 3310 at __fortify_report+0x45/0x50
        CPU: 3 UID: 0 PID: 3310 Comm: btrfs Not tainted 6.11.0-prnet #1
        Hardware name: CompuLab Ltd.  sbc-ihsw/Intense-PC2 (IPC2), BIOS IPC2_3.330.7 X64 03/15/2018
        RIP: 0010:__fortify_report+0x45/0x50
        Code: 48 8b 34 (...)
        RSP: 0018:ffff97ebc0d6f650 EFLAGS: 00010246
        RAX: 7749924ef60fa600 RBX: ffff8bf5446a521a RCX: 0000000000000027
        RDX: 00000000ffffdfff RSI: ffff97ebc0d6f548 RDI: ffff8bf84e7a1cc8
        RBP: ffff8bf548574080 R08: ffffffffa8c40e10 R09: 0000000000005ffd
        R10: 0000000000000004 R11: ffffffffa8c70e10 R12: ffff8bf551eef400
        R13: 0000000000000000 R14: 0000000000000013 R15: 00000000000003a8
        FS:  00007fae144de8c0(0000) GS:ffff8bf84e780000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fae14691690 CR3: 00000001027a2003 CR4: 00000000001706f0
        Call Trace:
         <TASK>
         ? __warn+0x12a/0x1d0
         ? __fortify_report+0x45/0x50
         ? report_bug+0x154/0x1c0
         ? handle_bug+0x42/0x70
         ? exc_invalid_op+0x1a/0x50
         ? asm_exc_invalid_op+0x1a/0x20
         ? __fortify_report+0x45/0x50
         __fortify_panic+0x9/0x10
        __get_cur_name_and_parent+0x3bc/0x3c0
         get_cur_path+0x207/0x3b0
         send_extent_data+0x709/0x10d0
         ? find_parent_nodes+0x22df/0x25d0
         ? mas_nomem+0x13/0x90
         ? mtree_insert_range+0xa5/0x110
         ? btrfs_lru_cache_store+0x5f/0x1e0
         ? iterate_extent_inodes+0x52d/0x5a0
         process_extent+0xa96/0x11a0
         ? __pfx_lookup_backref_cache+0x10/0x10
         ? __pfx_store_backref_cache+0x10/0x10
         ? __pfx_iterate_backrefs+0x10/0x10
         ? __pfx_check_extent_item+0x10/0x10
         changed_cb+0x6fa/0x930
         ? tree_advance+0x362/0x390
         ? memcmp_extent_buffer+0xd7/0x160
         send_subvol+0xf0a/0x1520
         btrfs_ioctl_send+0x106b/0x11d0
         ? __pfx___clone_root_cmp_sort+0x10/0x10
         _btrfs_ioctl_send+0x1ac/0x240
         btrfs_ioctl+0x75b/0x850
         __se_sys_ioctl+0xca/0x150
         do_syscall_64+0x85/0x160
         ? __count_memcg_events+0x69/0x100
         ? handle_mm_fault+0x1327/0x15c0
         ? __se_sys_rt_sigprocmask+0xf1/0x180
         ? syscall_exit_to_user_mode+0x75/0xa0
         ? do_syscall_64+0x91/0x160
         ? do_user_addr_fault+0x21d/0x630
        entry_SYSCALL_64_after_hwframe+0x76/0x7e
        RIP: 0033:0x7fae145eeb4f
        Code: 00 48 89 (...)
        RSP: 002b:00007ffdf1cb09b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fae145eeb4f
        RDX: 00007ffdf1cb0ad0 RSI: 0000000040489426 RDI: 0000000000000004
        RBP: 00000000000078fe R08: 00007fae144006c0 R09: 00007ffdf1cb0927
        R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffdf1cb1ce8
        R13: 0000000000000003 R14: 000055c499fab2e0 R15: 0000000000000004
         </TASK>
      
      Fix this by not storing the NUL string terminator since we don't actually
      need it for name cache entries, this way "name_len" corresponds to the
      actual size of the "name" array. This requires marking the "name" array
      field with __nonstring and using memcpy() instead of strcpy() as
      recommended by the guidelines at:
      
         https://github.com/KSPP/linux/issues/90Reported-by: default avatarDavid Arendt <admin@prnet.org>
      Link: https://lore.kernel.org/linux-btrfs/cee4591a-3088-49ba-99b8-d86b4242b8bd@prnet.org/
      Fixes: c0247d28 ("btrfs: send: annotate struct name_cache_entry with __counted_by()")
      CC: stable@vger.kernel.org # 6.11
      Tested-by: default avatarDavid Arendt <admin@prnet.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96c6ca71
  2. 17 Sep, 2024 3 commits
    • Filipe Manana's avatar
      btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag · 7f1b63f9
      Filipe Manana authored
      When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called
      during remount and unmount, we are freeing every node from the rbtree
      that tracks inodes for auto defrag using
      rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree
      itself. So once we unlock the lock that protects the rbtree, we have a
      tree pointing to a root that was freed (and a root pointing to freed
      nodes, and their children pointing to other freed nodes, and so on).
      This makes further access to the tree result in a use-after-free with
      unpredictable results.
      
      Fix this by initializing the rbtree to an empty root after the call to
      rbtree_postorder_for_each_entry_safe() and before unlocking.
      
      Fixes: 27694091 ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()")
      Reported-by: syzbot+ad7966ca1f5dd8b001b3@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000f9aad406223eabff@google.com/Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f1b63f9
    • Qu Wenruo's avatar
      btrfs: tree-checker: fix the wrong output of data backref objectid · b0b595e6
      Qu Wenruo authored
      [BUG]
      There are some reports about invalid data backref objectids, the report
      looks like this:
      
        BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543
      
      The data ref objectid is the inode number inside the subvolume.
      
      But in above case, the value is completely sane, not really showing the
      problem.
      
      [CAUSE]
      The root cause of the problem is the deprecated feature, inode cache.
      
      This feature results a special inode number, -12ULL, and it's no longer
      recognized by tree-checker, triggering the error.
      
      The direct problem here is the output of data ref objectid. The value
      shown is in fact the dref_root (subvolume id), not the dref_objectid
      (inode number).
      
      [FIX]
      Fix the output to use dref_objectid instead.
      Reported-by: default avatarNeil Parton <njparton@gmail.com>
      Reported-by: default avatarArchange <archange@archlinux.org>
      Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/9541deea-9056-406e-be16-a996b549614d@archlinux.org/
      Fixes: f333a3c7 ("btrfs: tree-checker: validate dref root and objectid")
      CC: stable@vger.kernel.org # 6.11
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0b595e6
    • Filipe Manana's avatar
      btrfs: fix race setting file private on concurrent lseek using same fd · 7ee85f55
      Filipe Manana authored
      When doing concurrent lseek(2) system calls against the same file
      descriptor, using multiple threads belonging to the same process, we have
      a short time window where a race happens and can result in a memory leak.
      
      The race happens like this:
      
      1) A program opens a file descriptor for a file and then spawns two
         threads (with the pthreads library for example), lets call them
         task A and task B;
      
      2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
         file.c:find_desired_extent() while holding a read lock on the inode;
      
      3) At the start of find_desired_extent(), it extracts the file's
         private_data pointer into a local variable named 'private', which has
         a value of NULL;
      
      4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
         in shared mode and enters file.c:find_desired_extent(), where it also
         extracts file->private_data into its local variable 'private', which
         has a NULL value;
      
      5) Because it saw a NULL file private, task A allocates a private
         structure and assigns to the file structure;
      
      6) Task B also saw a NULL file private so it also allocates its own file
         private and then assigns it to the same file structure, since both
         tasks are using the same file descriptor.
      
         At this point we leak the private structure allocated by task A.
      
      Besides the memory leak, there's also the detail that both tasks end up
      using the same cached state record in the private structure (struct
      btrfs_file_private::llseek_cached_state), which can result in a
      use-after-free problem since one task can free it while the other is
      still using it (only one task took a reference count on it). Also, sharing
      the cached state is not a good idea since it could result in incorrect
      results in the future - right now it should not be a problem because it
      end ups being used only in extent-io-tree.c:count_range_bits() where we do
      range validation before using the cached state.
      
      Fix this by protecting the private assignment and check of a file while
      holding the inode's spinlock and keep track of the task that allocated
      the private, so that it's used only by that task in order to prevent
      user-after-free issues with the cached state record as well as potentially
      using it incorrectly in the future.
      
      Fixes: 3c32c721 ("btrfs: use cached state when looking for delalloc ranges with lseek")
      CC: stable@vger.kernel.org # 6.6+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ee85f55
  3. 10 Sep, 2024 33 commits
    • Qu Wenruo's avatar
      btrfs: only unlock the to-be-submitted ranges inside a folio · bd610c09
      Qu Wenruo authored
      [SUBPAGE COMPRESSION LIMITS]
      Currently inside writepage_delalloc(), if a delalloc range is going to
      be submitted asynchronously (inline or compression, the page
      dirty/writeback/unlock are all handled in at different time, not at the
      submission time), then we return 1 and extent_writepage() will skip the
      submission.
      
      This is fine if every sector matches page size, but if a sector is
      smaller than page size (aka, subpage case), then it can be very
      problematic, for example for the following 64K page:
      
           0     16K     32K    48K     64K
           |/|   |///////|      |/|
             |                    |
             4K                   52K
      
      Where |/| is the dirty range we need to submit.
      
      In the above case, we need the following different handling for the 3
      ranges:
      
      - [0, 4K) needs to be submitted for regular write
        A single sector cannot be compressed.
      
      - [16K, 32K) needs to be submitted for compressed write
      
      - [48K, 52K) needs to be submitted for regular write.
      
      Above, if we try to submit [16K, 32K) for compressed write, we will
      return 1 and immediately, and without submitting the remaining
      [48K, 52K) range.
      
      Furthermore, since extent_writepage() will exit without unlocking any
      sectors, the submitted range [0, 4K) will not have sector unlocked.
      
      That's the reason why for now subpage is only allowed for full page
      range.
      
      [ENHANCEMENT]
      - Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
        This records which sectors will be submitted by extent_writepage_io().
        This allows us to track which sectors needs to be submitted thus later
        to be properly unlocked.
      
        For asynchronously submitted range (inline/compression), the
        corresponding bits will be cleared from that bitmap.
      
      - Only return 1 if no sector needs to be submitted in
        writepage_delalloc()
      
      - Only submit sectors marked by submission bitmap inside
        extent_writepage_io()
        So we won't touch the asynchronously submitted part.
      
      - Introduce btrfs_folio_end_writer_lock_bitmap() helper
        This will only unlock the involved sectors specified by @bitmap
        parameter, to avoid touching the range asynchronously submitted.
      
      Please note that, since subpage compression is still limited to page
      aligned range, this change is only a preparation for future sector
      perfect compression support for subpage.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd610c09
    • Qu Wenruo's avatar
      btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock() · 49a99073
      Qu Wenruo authored
      The function btrfs_folio_unlock_writer() is already calling
      btrfs_folio_end_writer_lock() to do the heavy lifting work, the only
      missing 0 writer check.
      
      Thus there is no need to keep two different functions, move the 0 writer
      check into btrfs_folio_end_writer_lock(), and remove
      btrfs_folio_unlock_writer().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49a99073
    • Leo Martins's avatar
      btrfs: BTRFS_PATH_AUTO_FREE in orphan.c · 68f32b9c
      Leo Martins authored
      All cleanup paths lead to btrfs_path_free so path can be defined with
      the automatic freeing callback in the following functions:
      
      - btrfs_insert_orphan_item()
      - btrfs_del_orphan_item()
      Signed-off-by: default avatarLeo Martins <loemra.dev@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68f32b9c
    • Leo Martins's avatar
      btrfs: use btrfs_path auto free in zoned.c · 45763a0c
      Leo Martins authored
      All cleanup paths lead to btrfs_path_free so path can be defined with
      the automatic freeing callback in the following functions:
      
      - calculate_emulated_zone_size()
      - calculate_alloc_pointer()
      Signed-off-by: default avatarLeo Martins <loemra.dev@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45763a0c
    • Leo Martins's avatar
      btrfs: DEFINE_FREE for struct btrfs_path · 4c74a32a
      Leo Martins authored
      Add a DEFINE_FREE for struct btrfs_path. This defines a function that
      can be called using the __free attribute. Define a macro
      BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path
      very clear.
      
      The intended use is to define the auto free of path in cases where the
      path is allocated somewhere at the beginning and freed either on all
      error paths or at the end of the function.
      
        int func() {
      	  BTRFS_PATH_AUTO_FREE(path);
      
      	  if (...)
      		  return -ERROR;
      
      	  path = alloc_path();
      
      	  ...
      
      	  if (...)
      		  return -ERROR;
      
      	  ...
      	  return 0;
        }
      Signed-off-by: default avatarLeo Martins <loemra.dev@gmail.com>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c74a32a
    • Qu Wenruo's avatar
      btrfs: remove btrfs_folio_end_all_writers() · ab6eac7c
      Qu Wenruo authored
      The function btrfs_folio_end_all_writers() is only utilized in
      extent_writepage() as a way to unlock all subpage range (for both
      successful submission and error handling).
      
      Meanwhile we have a similar function, btrfs_folio_end_writer_lock().
      
      The difference is, btrfs_folio_end_writer_lock() expects a range that is
      a subset of the already locked range.
      
      This limit on btrfs_folio_end_writer_lock() is a little overkilled,
      preventing it from being utilized for error paths.
      
      So here we enhance btrfs_folio_end_writer_lock() to accept a superset of
      the locked range, and only end the locked subset.
      This means we can replace btrfs_folio_end_all_writers() with
      btrfs_folio_end_writer_lock() instead.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab6eac7c
    • David Sterba's avatar
      btrfs: constify more pointer parameters · ca283ea9
      David Sterba authored
      Continue adding const to parameters.  This is for clarity and minor
      addition to safety. There are some minor effects, in the assembly code
      and .ko measured on release config.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca283ea9
    • David Sterba's avatar
      btrfs: rework BTRFS_I as macro to preserve parameter const · 070969f1
      David Sterba authored
      Currently BTRFS_I is a static inline function that takes a const inode
      and returns btrfs inode, dropping the 'const' qualifier. This can break
      assumptions of compiler though it seems there's no real case.
      
      To make the parameter and return type consistent regardint const we can
      use the container_of_const() that preserves it. However this would not
      check the parameter type. To fix that use the same _Generic construct
      but implement only the two expected types.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      070969f1
    • Filipe Manana's avatar
      btrfs: add and use helper to verify the calling task has locked the inode · 1b6e068a
      Filipe Manana authored
      We have a few places that check if we have the inode locked by doing:
      
          ASSERT(inode_is_locked(vfs_inode));
      
      This actually proved to be useful several times as if assertions are
      enabled (and by default they are in many distros) it immediately triggers
      a crash which is impossible for users to miss.
      
      However that doesn't check if the lock is held by the calling task, so
      the check passes if some other task locked the inode.
      
      Using one of the lockdep functions to check the lock is held, like
      lockdep_assert_held() for example, does check that the calling task
      holds the lock, and if that's not the case it produces a warning and
      stack trace in dmesg. However, despite the misleading "assert" in the
      name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
      a warning and splat in dmesg, which is easy to get unnoticed by users
      who may have lockdep enabled.
      
      So add a helper that does the ASSERT() and calls lockdep_assert_held()
      immediately after and use it every where we check the inode is locked.
      Like this if the lock is held by some other task we get the warning
      in dmesg which is caught by fstests, very helpful during development,
      and may also be occassionaly noticed by users with lockdep enabled.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b6e068a
    • Luca Stefani's avatar
      btrfs: always update fstrim_range on failure in FITRIM ioctl · 33685972
      Luca Stefani authored
      Even in case of failure we could've discarded some data and userspace
      should be made aware of it, so copy fstrim_range to userspace
      regardless.
      
      Also make sure to update the trimmed bytes amount even if
      btrfs_trim_free_extents fails.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarLuca Stefani <luca.stefani.ge1@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33685972
    • Li Zetao's avatar
      btrfs: convert copy_inline_to_page() to use folio · faad57ae
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover find_or_create_page() is compatible API, and it can
      replaced with __filemap_get_folio(). Some interfaces have been converted
      to use folio before, so the conversion operation from page can be
      eliminated here.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      faad57ae
    • Li Zetao's avatar
      btrfs: convert btrfs_decompress() to take a folio · aeb6d881
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Based on the previous patch, the compression path can be
      directly used in folio without converting to page.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aeb6d881
    • Li Zetao's avatar
      btrfs: convert zstd_decompress() to take a folio · b70f3a45
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
      But there is no memzero_folio(), but it can be replaced equivalently by
      folio_zero_range().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b70f3a45
    • Li Zetao's avatar
      btrfs: convert lzo_decompress() to take a folio · 9f9a4e43
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
      But there is no memzero_folio(), but it can be replaced equivalently by
      folio_zero_range().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f9a4e43
    • Li Zetao's avatar
      btrfs: convert zlib_decompress() to take a folio · 54c78d49
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
      But there is no memzero_folio(), but it can be replaced equivalently by
      folio_zero_range().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54c78d49
    • Li Zetao's avatar
      btrfs: convert try_release_extent_mapping() to take a folio · 046c0d65
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And page_to_inode() can be replaced with folio_to_inode() now.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      046c0d65
    • Li Zetao's avatar
      btrfs: convert try_release_extent_state() to take a folio · dd0a8df4
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover, use folio_pos() instead of page_offset(),
      which is more consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd0a8df4
    • Li Zetao's avatar
      btrfs: convert submit_eb_page() to take a folio · 08dd8507
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08dd8507
    • Li Zetao's avatar
      btrfs: convert submit_eb_subpage() to take a folio · 13587325
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover, use folio_pos() instead of page_offset(),
      which is more consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13587325
    • Li Zetao's avatar
      btrfs: convert read_key_bytes() to take a folio · 88493779
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(),
      which is more consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88493779
    • Li Zetao's avatar
      btrfs: convert try_release_extent_buffer() to take a folio · b8ae2bfa
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8ae2bfa
    • Li Zetao's avatar
      btrfs: convert try_release_subpage_extent_buffer() to take a folio · 0145aa38
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And use folio_pos instead of page_offset, which is more
      consistent with folio usage. At the same time, folio_test_private() can
      handle folio directly without converting from page to folio first.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0145aa38
    • Li Zetao's avatar
      btrfs: convert get_next_extent_buffer() to take a folio · d4aeb5f7
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Use folio_pos instead of page_offset, which is more
      consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4aeb5f7
    • Li Zetao's avatar
      btrfs: convert clear_page_extent_mapped() to take a folio · 266a9361
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Now clear_page_extent_mapped() can deal with a folio
      directly, so change its name to clear_folio_extent_mapped().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      266a9361
    • Qu Wenruo's avatar
      btrfs: make compression path to be subpage compatible · fd1e75d0
      Qu Wenruo authored
      Currently btrfs compression path is not really subpage compatible, every
      thing is still done in page unit.
      
      That's fine for regular sector size and subpage routine. As even for
      subpage routine compression is only enabled if the whole range is page
      aligned, so reading the page cache in page unit is totally fine.
      
      However in preparation for the future subpage perfect compression
      support, we need to change the compression routine to properly handle a
      subpage range.
      
      This patch would prepare both zlib and zstd to only read the subpage
      range for compression.
      Lzo is already doing subpage aware read, as lzo's on-disk format is
      already sectorsize dependent.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd1e75d0
    • Qu Wenruo's avatar
      btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io() · 9ca0e58c
      Qu Wenruo authored
      There are only two differences between the two functions:
      
      - btrfs_orig_bbio_end_io() does extra error propagation
        This is mostly to allow tolerance for write errors.
      
      - btrfs_orig_bbio_end_io() does extra pending_ios check
        This check can handle both the original bio, or the cloned one.
        (All accounting happens in the original one).
      
      This makes btrfs_orig_bbio_end_io() a much safer call.
      In fact we already had a double freeing error due to usage of
      btrfs_bio_end_io() in the error path of btrfs_submit_chunk().
      
      So just move the whole content of btrfs_orig_bbio_end_io() into
      btrfs_bio_end_io().
      
      For normal paths this brings no change, because they are already calling
      btrfs_orig_bbio_end_io() in the first place.
      
      For error paths (not only inside bio.c but also external callers), this
      change will introduce extra checks, especially for external callers, as
      they will error out without submitting the btrfs bio.
      
      But considering it's already in the error path, such slower but much
      safer checks are still an overall win.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ca0e58c
    • Josef Bacik's avatar
      btrfs: do not hold the extent lock for entire read · ac325fc2
      Josef Bacik authored
      Historically we've held the extent lock throughout the entire read.
      There's been a few reasons for this, but it's mostly just caused us
      problems.  For example, this prevents us from allowing page faults
      during direct io reads, because we could deadlock.  This has forced us
      to only allow 4k reads at a time for io_uring NOWAIT requests because we
      have no idea if we'll be forced to page fault and thus have to do a
      whole lot of work.
      
      On the buffered side we are protected by the page lock, as long as we're
      reading things like buffered writes, punch hole, and even direct IO to a
      certain degree will get hung up on the page lock while the page is in
      flight.
      
      On the direct side we have the dio extent lock, which acts much like the
      way the extent lock worked previously to this patch, however just for
      direct reads.  This protects direct reads from concurrent direct writes,
      while we're protected from buffered writes via the inode lock.
      
      Now that we're protected in all cases, narrow the extent lock to the
      part where we're getting the extent map to submit the reads, no longer
      holding the extent lock for the entire read operation.  Push the extent
      lock down into do_readpage() so that we're only grabbing it when looking
      up the extent map.  This portion was contributed by Goldwyn.
      Co-developed-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac325fc2
    • Josef Bacik's avatar
      btrfs: take the dio extent lock during O_DIRECT operations · 07d399cb
      Josef Bacik authored
      Currently we hold the extent lock for the entire duration of a read.
      This isn't really necessary in the buffered case, we're protected by the
      page lock, however it's necessary for O_DIRECT.
      
      For O_DIRECT reads, if we only locked the extent for the part where we
      get the extent, we could potentially race with an O_DIRECT write in the
      same region.  This isn't really a problem, unless the read is delayed so
      much that the write does the COW, unpins the old extent, and some other
      application re-allocates the extent before the read is actually able to
      be submitted.  At that point at best we'd have a checksum mismatch, but
      at worse we could read data that doesn't belong to us.
      
      To address this potential race we need to make sure we don't have
      overlapping, concurrent direct io reads and writes.
      
      To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO
      case in the same spot as the current extent lock.  The writes will take
      this while they're creating the ordered extent, which is also used to
      make sure concurrent buffered reads or concurrent direct reads are not
      allowed to occur, and drop it after the ordered extent is taken.  For
      reads it will act as the current read behavior for the EXTENT_LOCKED
      bit, we set it when we're starting the read, we clear it in the end_io
      to allow other direct writes to continue.
      
      This still has the drawback of disallowing concurrent overlapping direct
      reads from occurring, but that exists with the current extent locking.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07d399cb
    • Josef Bacik's avatar
      btrfs: introduce EXTENT_DIO_LOCKED · 7e2a5950
      Josef Bacik authored
      In order to support dropping the extent lock during a read we need a way
      to make sure that direct reads and direct writes for overlapping ranges
      are protected from each other.  To accomplish this introduce another
      lock bit specifically for direct io.  Subsequent patches will utilize
      this to protect direct IO operations.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e2a5950
    • David Sterba's avatar
      btrfs: always pass readahead state to defrag · df2825e9
      David Sterba authored
      Defrag ioctl passes readahead from the file, but autodefrag does not
      have a file so the readahead state is allocated when needed.
      
      The autodefrag loop in cleaner thread iterates over inodes so we can
      simply provide an on-stack readahead state and will not need to allocate
      it in btrfs_defrag_file(). The size is 32 bytes which is acceptable.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df2825e9
    • David Sterba's avatar
      btrfs: drop transaction parameter from btrfs_add_inode_defrag() · 11e3107d
      David Sterba authored
      There's only one caller inode_should_defrag() that passes NULL to
      btrfs_add_inode_defrag() so we can drop it an simplify the code.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11e3107d
    • David Sterba's avatar
      btrfs: return void from btrfs_add_inode_defrag() · 91c9f285
      David Sterba authored
      The potential memory allocation failure is not a fatal error, skipping
      autodefrag is fine and the caller inode_should_defrag() does not care
      about the errors.  Further writes can attempt to add the inode back to
      the defragmentation list again.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91c9f285
    • David Sterba's avatar
      btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes() · 27694091
      David Sterba authored
      btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
      or unmount, but the way it frees the inodes in fs_info->defrag_inodes
      is inefficient. Each time it needs to locate first node, remove it,
      potentially rebalance tree until it's done. This allows to do a
      conditional reschedule.
      
      For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
      convenient but we can't reschedule and restart iteration because some of
      the tree nodes would be already freed.
      
      The cleanup operation is kmem_cache_free() which will likely take the
      fast path for most objects so rescheduling should not be necessary.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27694091