1. 18 Jul, 2023 8 commits
    • Filipe Manana's avatar
      btrfs: fix warning when putting transaction with qgroups enabled after abort · aa84ce8a
      Filipe Manana authored
      If we have a transaction abort with qgroups enabled we get a warning
      triggered when doing the final put on the transaction, like this:
      
        [552.6789] ------------[ cut here ]------------
        [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6817] Modules linked in: btrfs blake2b_generic xor (...)
        [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
        [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6821] Code: bd a0 01 00 (...)
        [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286
        [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000
        [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010
        [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20
        [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70
        [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028
        [552.6821] FS:  0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000
        [552.6821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0
        [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [552.6822] Call Trace:
        [552.6822]  <TASK>
        [552.6822]  ? __warn+0x80/0x130
        [552.6822]  ? btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6824]  ? report_bug+0x1f4/0x200
        [552.6824]  ? handle_bug+0x42/0x70
        [552.6824]  ? exc_invalid_op+0x14/0x70
        [552.6824]  ? asm_exc_invalid_op+0x16/0x20
        [552.6824]  ? btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6826]  btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs]
        [552.6828]  ? _raw_spin_unlock_irqrestore+0x23/0x40
        [552.6828]  ? try_to_wake_up+0x94/0x5e0
        [552.6828]  ? __pfx_process_timeout+0x10/0x10
        [552.6828]  transaction_kthread+0x103/0x1d0 [btrfs]
        [552.6830]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
        [552.6832]  kthread+0xee/0x120
        [552.6832]  ? __pfx_kthread+0x10/0x10
        [552.6832]  ret_from_fork+0x29/0x50
        [552.6832]  </TASK>
        [552.6832] ---[ end trace 0000000000000000 ]---
      
      This corresponds to this line of code:
      
        void btrfs_put_transaction(struct btrfs_transaction *transaction)
        {
            (...)
                WARN_ON(!RB_EMPTY_ROOT(
                                &transaction->delayed_refs.dirty_extent_root));
            (...)
        }
      
      The warning happens because btrfs_qgroup_destroy_extent_records(), called
      in the transaction abort path, we free all entries from the rbtree
      "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we
      don't actually empty the rbtree - it's still pointing to nodes that were
      freed.
      
      So set the rbtree's root node to NULL to avoid this warning (assign
      RB_ROOT).
      
      Fixes: 81f7eb00 ("btrfs: destroy qgroup extent records on transaction abort")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa84ce8a
    • Christoph Hellwig's avatar
      btrfs: fix ordered extent split error handling in btrfs_dio_submit_io · 7cad645e
      Christoph Hellwig authored
      When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io
      fails to allocate memory for a new ordered_extent, it calls into the
      btrfs_dio_end_io for error handling.  btrfs_dio_end_io then assumes that
      bbio->ordered is set because it is supposed to be at this point, except
      for this error handling corner case.  Try to not overload the
      btrfs_dio_end_io with error handling of a bio in a non-canonical state,
      and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io
      directly for this error case.
      Reported-by: default avatarsyzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
      Fixes: b41b6f69 ("btrfs: use btrfs_finish_ordered_extent to complete direct writes")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Tested-by: default avatarsyzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7cad645e
    • Josef Bacik's avatar
      btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand · 17b17fcd
      Josef Bacik authored
      While trying to get the subpage blocksize tests running, I hit the
      following panic on generic/476
      
        assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
        kernel BUG at fs/btrfs/subpage.c:229!
        Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
        CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12
        Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023
        pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
        pc : btrfs_subpage_assert+0xbc/0xf0
        lr : btrfs_subpage_assert+0xbc/0xf0
        Call trace:
         btrfs_subpage_assert+0xbc/0xf0
         btrfs_subpage_clear_checked+0x38/0xc0
         btrfs_page_clear_checked+0x48/0x98
         btrfs_truncate_block+0x5d0/0x6a8
         btrfs_cont_expand+0x5c/0x528
         btrfs_write_check.isra.0+0xf8/0x150
         btrfs_buffered_write+0xb4/0x760
         btrfs_do_write_iter+0x2f8/0x4b0
         btrfs_file_write_iter+0x1c/0x30
         do_iter_readv_writev+0xc8/0x158
         do_iter_write+0x9c/0x210
         vfs_iter_write+0x24/0x40
         iter_file_splice_write+0x224/0x390
         direct_splice_actor+0x38/0x68
         splice_direct_to_actor+0x12c/0x260
         do_splice_direct+0x90/0xe8
         generic_copy_file_range+0x50/0x90
         vfs_copy_file_range+0x29c/0x470
         __arm64_sys_copy_file_range+0xcc/0x498
         invoke_syscall.constprop.0+0x80/0xd8
         do_el0_svc+0x6c/0x168
         el0_svc+0x50/0x1b0
         el0t_64_sync_handler+0x114/0x120
         el0t_64_sync+0x194/0x198
      
      This happens because during btrfs_cont_expand we'll get a page, set it
      as mapped, and if it's not Uptodate we'll read it.  However between the
      read and re-locking the page we could have called release_folio() on the
      page, but left the page in the file mapping.  release_folio() can clear
      the page private, and thus further down we blow up when we go to modify
      the subpage bits.
      
      Fix this by putting the set_page_extent_mapped() after the read.  This
      is safe because read_folio() will call set_page_extent_mapped() before
      it does the read, and then if we clear page private but leave it on the
      mapping we're completely safe re-setting set_page_extent_mapped().  With
      this patch I can now run generic/476 without panicing.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17b17fcd
    • Qu Wenruo's avatar
      btrfs: raid56: always verify the P/Q contents for scrub · 486c737f
      Qu Wenruo authored
      [REGRESSION]
      Commit 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery
      path to use error_bitmap") changed the behavior of scrub_rbio().
      
      Initially if we have no error reading the raid bio, we will assign
      @need_check to true, then finish_parity_scrub() would later verify the
      content of P/Q stripes before writeback.
      
      But after that commit we never verify the content of P/Q stripes and
      just writeback them.
      
      This can lead to unrepaired P/Q stripes during scrub, or already
      corrupted P/Q copied to the dev-replace target.
      
      [FIX]
      The situation is more complex than the regression, in fact the initial
      behavior is not 100% correct either.
      
      If we have the following rare case, it can still lead to the same
      problem using the old behavior:
      
      		0	16K	32K	48K	64K
      	Data 1:	|IIIIIII|                       |
      	Data 2:	|				|
      	Parity:	|	|CCCCCCC|		|
      
      Where "I" means IO error, "C" means corruption.
      
      In the above case, we're scrubbing the parity stripe, then read out all
      the contents of Data 1, Data 2, Parity stripes.
      
      But found IO error in Data 1, which leads to rebuild using Data 2 and
      Parity and got the correct data.
      
      In that case, we would not verify if the Parity is correct for range
      [16K, 32K).
      
      So here we have to always verify the content of Parity no matter if we
      did recovery or not.
      
      This patch would remove the @need_check parameter of
      finish_parity_scrub() completely, and would always do the P/Q
      verification before writeback.
      
      Fixes: 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap")
      CC: stable@vger.kernel.org # 6.2+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      486c737f
    • Filipe Manana's avatar
      btrfs: use irq safe locking when running and adding delayed iputs · 866e98a4
      Filipe Manana authored
      Running delayed iputs, which never happens in an irq context, needs to
      lock the spinlock fs_info->delayed_iput_lock. When finishing bios for
      data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which
      needs to add a delayed iput and for that it needs to acquire the spinlock
      fs_info->delayed_iput_lock. Without disabling irqs when running delayed
      iputs we can therefore deadlock on that spinlock. The same deadlock can
      also happen when adding an inode to the delayed iputs list, since this
      can be done outside an irq context as well.
      
      Syzbot recently reported this, which results in the following trace:
      
        ================================
        WARNING: inconsistent lock state
        6.4.0-syzkaller-09904-ga507db1d #0 Not tainted
        --------------------------------
        inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes:
        ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline]
        ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
        {IN-SOFTIRQ-W} state was registered at:
          lock_acquire kernel/locking/lockdep.c:5761 [inline]
          lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
          __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
          _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
          spin_lock include/linux/spinlock.h:350 [inline]
          btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490
          btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline]
          btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547
          __btrfs_bio_end_io fs/btrfs/bio.c:118 [inline]
          __btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112
          btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163
          btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378
          bio_endio+0x589/0x690 block/bio.c:1617
          req_bio_endio block/blk-mq.c:766 [inline]
          blk_update_request+0x5c5/0x1620 block/blk-mq.c:911
          blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032
          lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
          blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110
          __do_softirq+0x1d4/0x905 kernel/softirq.c:553
          run_ksoftirqd kernel/softirq.c:921 [inline]
          run_ksoftirqd+0x31/0x60 kernel/softirq.c:913
          smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
          kthread+0x344/0x440 kernel/kthread.c:389
          ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
        irq event stamp: 39
        hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline]
        hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline]
        hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575
        hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline]
        hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline]
        hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575
        softirqs last  enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448
        softirqs last disabled at (0): [<0000000000000000>] 0x0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&fs_info->delayed_iput_lock);
          <Interrupt>
            lock(&fs_info->delayed_iput_lock);
      
         *** DEADLOCK ***
      
        1 lock held by btrfs-cleaner/16079:
         #0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463
      
        stack backtrace:
        CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d #0
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
         print_usage_bug kernel/locking/lockdep.c:3978 [inline]
         valid_state kernel/locking/lockdep.c:4020 [inline]
         mark_lock_irq kernel/locking/lockdep.c:4223 [inline]
         mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685
         mark_lock kernel/locking/lockdep.c:4649 [inline]
         mark_usage kernel/locking/lockdep.c:4598 [inline]
         __lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098
         lock_acquire kernel/locking/lockdep.c:5761 [inline]
         lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
         __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
         _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
         spin_lock include/linux/spinlock.h:350 [inline]
         btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
         cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478
         kthread+0x344/0x440 kernel/kthread.c:389
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
         </TASK>
      
      So fix this by using spin_lock_irq() and spin_unlock_irq() when running
      delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore()
      when adding a delayed iput().
      
      Reported-by: syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com
      Fixes: ec63b84d ("btrfs: add an ordered_extent pointer to struct btrfs_bio")
      Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      866e98a4
    • Filipe Manana's avatar
      btrfs: fix iput() on error pointer after error during orphan cleanup · cbaee87f
      Filipe Manana authored
      At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns
      an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the
      inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body
      of the following if statement:
      
          if (ret == -ENOENT || inode->i_nlink) {
              (...)
              trans = btrfs_start_transaction(root, 1);
              if (IS_ERR(trans)) {
                  ret = PTR_ERR(trans);
                  iput(inode);
                  goto out;
              }
              (...)
              ret = btrfs_del_orphan_item(trans, root,
                                          found_key.objectid);
              btrfs_end_transaction(trans);
              if (ret) {
                  iput(inode);
                  goto out;
              }
              continue;
          }
      
      If we get an error from btrfs_start_transaction() or from the call to
      btrfs_del_orphan_item() we end calling iput() against an inode pointer
      that has a value of ERR_PTR(-ENOENT), resulting in a crash with the
      following trace:
      
        [876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096
        [876.667] #PF: supervisor read access in kernel mode
        [876.667] #PF: error_code(0x0000) - not-present page
        [876.667] PGD 0 P4D 0
        [876.668] Oops: 0000 [#1] PREEMPT SMP PTI
        [876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
        [876.668] RIP: 0010:iput+0xa/0x20
        [876.668] Code: ff ff ff 66 (...)
        [876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282
        [876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000
        [876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe
        [876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4
        [876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0
        [876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4
        [876.670] FS:  00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000
        [876.670] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0
        [876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [876.672] Call Trace:
        [876.744]  <TASK>
        [876.744]  ? __die_body+0x1b/0x60
        [876.744]  ? page_fault_oops+0x15d/0x450
        [876.745]  ? __kmem_cache_alloc_node+0x47/0x410
        [876.745]  ? do_user_addr_fault+0x65/0x8a0
        [876.745]  ? exc_page_fault+0x74/0x170
        [876.746]  ? asm_exc_page_fault+0x22/0x30
        [876.746]  ? iput+0xa/0x20
        [876.746]  btrfs_orphan_cleanup+0x221/0x330 [btrfs]
        [876.746]  btrfs_lookup_dentry+0x58f/0x5f0 [btrfs]
        [876.747]  btrfs_lookup+0xe/0x30 [btrfs]
        [876.747]  __lookup_slow+0x82/0x130
        [876.785]  walk_component+0xe5/0x160
        [876.786]  path_lookupat.isra.0+0x6e/0x150
        [876.786]  filename_lookup+0xcf/0x1a0
        [876.786]  ? mod_objcg_state+0xd2/0x360
        [876.786]  ? obj_cgroup_charge+0xf5/0x110
        [876.787]  ? should_failslab+0xa/0x20
        [876.787]  ? kmem_cache_alloc+0x47/0x450
        [876.787]  vfs_path_lookup+0x51/0x90
        [876.788]  mount_subtree+0x8d/0x130
        [876.788]  btrfs_mount+0x149/0x410 [btrfs]
        [876.788]  ? __kmem_cache_alloc_node+0x47/0x410
        [876.788]  ? vfs_parse_fs_param+0xc0/0x110
        [876.789]  legacy_get_tree+0x24/0x50
        [876.834]  vfs_get_tree+0x22/0xd0
        [876.852]  path_mount+0x2d8/0x9c0
        [876.852]  do_mount+0x79/0x90
        [876.852]  __x64_sys_mount+0x8e/0xd0
        [876.853]  do_syscall_64+0x38/0x90
        [876.899]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [876.958] RIP: 0033:0x7f5fbe50b76a
        [876.959] Code: 48 8b 0d a9 (...)
        [876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        [876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a
        [876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0
        [876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001
        [876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        [876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70
        [876.960]  </TASK>
      
      So fix this by setting 'inode' to NULL whenever we get an error from
      btrfs_iget(), and to make the code simpler, stop testing for 'ret' being
      -ENOENT to check if we have an inode - instead test for 'inode' being NULL
      or not. Having a NULL 'inode' prevents any iput() call from crashing, as
      iput() ignores NULL inode pointers. Also, stop testing for a NULL return
      value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never
      returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT),
      and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM).
      We also don't need the extra iput() calls on the error branches for the
      btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have
      already called iput() before, so remove them.
      
      Fixes: a13bb2c0 ("btrfs: add missing iputs on orphan cleanup failure")
      CC: stable@vger.kernel.org # 6.4
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbaee87f
    • Filipe Manana's avatar
      btrfs: fix double iput() on inode after an error during orphan cleanup · b777d279
      Filipe Manana authored
      At btrfs_orphan_cleanup(), if we were able to find the inode, we do an
      iput() on the inode, then if btrfs_drop_verity_items() succeeds and then
      either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do
      another iput() in the respective error paths, resulting in an extra iput()
      on the inode.
      
      Fix this by setting inode to NULL after the first iput(), as iput()
      ignores a NULL inode pointer argument.
      
      Fixes: a13bb2c0 ("btrfs: add missing iputs on orphan cleanup failure")
      CC: stable@vger.kernel.org # 6.4
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b777d279
    • Filipe Manana's avatar
      btrfs: zoned: fix memory leak after finding block group with super blocks · f1a07c2b
      Filipe Manana authored
      At exclude_super_stripes(), if we happen to find a block group that has
      super blocks mapped to it and we are on a zoned filesystem, we error out
      as this is not supposed to happen, indicating either a bug or maybe some
      memory corruption for example. However we are exiting the function without
      freeing the memory allocated for the logical address of the super blocks.
      Fix this by freeing the logical address.
      
      Fixes: 12659251 ("btrfs: implement log-structured superblock for ZONED mode")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1a07c2b
  2. 11 Jul, 2023 3 commits
    • Filipe Manana's avatar
      btrfs: fix use-after-free of new block group that became unused · 0657b20c
      Filipe Manana authored
      If a task creates a new block group and that block group becomes unused
      before we finish its creation, at btrfs_create_pending_block_groups(),
      then when btrfs_mark_bg_unused() is called against the block group, we
      assume that the block group is currently in the list of block groups to
      reclaim, and we move it out of the list of new block groups and into the
      list of unused block groups. This has two consequences:
      
      1) We move it out of the list of new block groups associated to the
         current transaction. So the block group creation is not finished and
         if we attempt to delete the bg because it's unused, we will not find
         the block group item in the extent tree (or the new block group tree),
         its device extent items in the device tree etc, resulting in the
         deletion to fail due to the missing items;
      
      2) We don't increment the reference count on the block group when we
         move it to the list of unused block groups, because we assumed the
         block group was on the list of block groups to reclaim, and in that
         case it already has the correct reference count. However the block
         group was on the list of new block groups, in which case no extra
         reference was taken because it's local to the current task. This
         later results in doing an extra reference count decrement when
         removing the block group from the unused list, eventually leading the
         reference count to 0.
      
      This second case was caught when running generic/297 from fstests, which
      produced the following assertion failure and stack trace:
      
        [589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
        [589.559] ------------[ cut here ]------------
        [589.559] kernel BUG at fs/btrfs/block-group.c:4299!
        [589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
        [589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
        [589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.561] Code: 68 62 da c0 (...)
        [589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
        [589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
        [589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
        [589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
        [589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
        [589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
        [589.563] FS:  00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
        [589.563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
        [589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [589.564] Call Trace:
        [589.564]  <TASK>
        [589.565]  ? __die_body+0x1b/0x60
        [589.565]  ? die+0x39/0x60
        [589.565]  ? do_trap+0xeb/0x110
        [589.565]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.566]  ? do_error_trap+0x6a/0x90
        [589.566]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.566]  ? exc_invalid_op+0x4e/0x70
        [589.566]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.567]  ? asm_exc_invalid_op+0x16/0x20
        [589.567]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.567]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.567]  close_ctree+0x35d/0x560 [btrfs]
        [589.568]  ? fsnotify_sb_delete+0x13e/0x1d0
        [589.568]  ? dispose_list+0x3a/0x50
        [589.568]  ? evict_inodes+0x151/0x1a0
        [589.568]  generic_shutdown_super+0x73/0x1a0
        [589.569]  kill_anon_super+0x14/0x30
        [589.569]  btrfs_kill_super+0x12/0x20 [btrfs]
        [589.569]  deactivate_locked_super+0x2e/0x70
        [589.569]  cleanup_mnt+0x104/0x160
        [589.570]  task_work_run+0x56/0x90
        [589.570]  exit_to_user_mode_prepare+0x160/0x170
        [589.570]  syscall_exit_to_user_mode+0x22/0x50
        [589.570]  ? __x64_sys_umount+0x12/0x20
        [589.571]  do_syscall_64+0x48/0x90
        [589.571]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [589.571] RIP: 0033:0x7f497ff0a567
        [589.571] Code: af 98 0e (...)
        [589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
        [589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
        [589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
        [589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        [589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
        [589.573]  </TASK>
        [589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
        [589.576] ---[ end trace 0000000000000000 ]---
      
      Fix this by adding a runtime flag to the block group to tell that the
      block group is still in the list of new block groups, and therefore it
      should not be moved to the list of unused block groups, at
      btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
      creation of the block group at btrfs_create_pending_block_groups().
      
      Fixes: a9f18971 ("btrfs: move out now unused BG from the reclaim list")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0657b20c
    • Christoph Hellwig's avatar
      btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block · 4e7de35e
      Christoph Hellwig authored
      The mirror_num_ret is allowed to be NULL, although it has to be set when
      smap is set.  Unfortunately that is not a well enough specifiable
      invariant for static type checkers, so add a NULL check to make sure they
      are fine.
      
      Fixes: 03793cbb ("btrfs: add fast path for single device io in __btrfs_map_block")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e7de35e
    • Josef Bacik's avatar
      btrfs: fix race between balance and cancel/pause · b19c98f2
      Josef Bacik authored
      Syzbot reported a panic that looks like this:
      
        assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED, in fs/btrfs/ioctl.c:465
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/messages.c:259!
        RIP: 0010:btrfs_assertfail+0x2c/0x30 fs/btrfs/messages.c:259
        Call Trace:
         <TASK>
         btrfs_exclop_balance fs/btrfs/ioctl.c:465 [inline]
         btrfs_ioctl_balance fs/btrfs/ioctl.c:3564 [inline]
         btrfs_ioctl+0x531e/0x5b30 fs/btrfs/ioctl.c:4632
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:870 [inline]
         __se_sys_ioctl fs/ioctl.c:856 [inline]
         __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The reproducer is running a balance and a cancel or pause in parallel.
      The way balance finishes is a bit wonky, if we were paused we need to
      save the balance_ctl in the fs_info, but clear it otherwise and cleanup.
      However we rely on the return values being specific errors, or having a
      cancel request or no pause request.  If balance completes and returns 0,
      but we have a pause or cancel request we won't do the appropriate
      cleanup, and then the next time we try to start a balance we'll trip
      this ASSERT.
      
      The error handling is just wrong here, we always want to clean up,
      unless we got -ECANCELLED and we set the appropriate pause flag in the
      exclusive op.  With this patch the reproducer ran for an hour without
      tripping, previously it would trip in less than a few minutes.
      
      Reported-by: syzbot+c0f3acf145cb465426d5@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b19c98f2
  3. 19 Jun, 2023 29 commits
    • Filipe Manana's avatar
      btrfs: fix race between quota disable and relocation · 8a4a0b2a
      Filipe Manana authored
      If we disable quotas while we have a relocation of a metadata block group
      that has extents belonging to the quota root, we can cause the relocation
      to fail with -ENOENT. This is because relocation builds backref nodes for
      extents of the quota root and later needs to walk the backrefs and access
      the quota root - however if in between a task disables quotas, it results
      in deleting the quota root from the root tree (with btrfs_del_root(),
      called from btrfs_quota_disable().
      
      This can be sporadically triggered by test case btrfs/255 from fstests:
      
        $ ./check btrfs/255
        FSTYP         -- btrfs
        PLATFORM      -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023
        MKFS_OPTIONS  -- /dev/sdc
        MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
      
        btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg)
        - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad)
            --- tests/btrfs/255.out	2023-03-02 21:47:53.876609426 +0000
            +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad	2023-06-16 10:20:39.267563212 +0100
            @@ -1,2 +1,4 @@
             QA output created by 255
            +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory
            +There may be more info in syslog - try dmesg | tail
             Silence is golden
            ...
            (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad'  to see the entire diff)
        Ran: btrfs/255
        Failures: btrfs/255
        Failed 1 of 1 tests
      
      To fix this make the quota disable operation take the cleaner mutex, as
      relocation of a block group also takes this mutex. This is also what we
      do when deleting a subvolume/snapshot, we take the cleaner mutex in the
      cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root()
      at btrfs_drop_snapshot() while under the protection of the cleaner mutex.
      
      Fixes: bed92eae ("Btrfs: qgroup implementation and prototypes")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a4a0b2a
    • Filipe Manana's avatar
      btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots · 08eb2ad9
      Filipe Manana authored
      Add a comment to struct btrfs_fs_info::dirty_cowonly_roots to mention
      that struct btrfs_fs_info::trans_lock is the lock that protects that
      list.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08eb2ad9
    • Filipe Manana's avatar
      btrfs: fix race when deleting free space root from the dirty cow roots list · babebf02
      Filipe Manana authored
      When deleting the free space tree we are deleting the free space root
      from the list fs_info->dirty_cowonly_roots without taking the lock that
      protects it, which is struct btrfs_fs_info::trans_lock.
      This unsynchronized list manipulation may cause chaos if there's another
      concurrent manipulation of this list, such as when adding a root to it
      with ctree.c:add_root_to_dirty_list().
      
      This can result in all sorts of weird failures caused by a race, such as
      the following crash:
      
        [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
        [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.279928] Code: 85 38 06 00 (...)
        [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
        [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
        [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
        [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
        [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
        [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
        [337571.281723] FS:  00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
        [337571.281950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
        [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [337571.282874] Call Trace:
        [337571.283101]  <TASK>
        [337571.283327]  ? __die_body+0x1b/0x60
        [337571.283570]  ? die_addr+0x39/0x60
        [337571.283796]  ? exc_general_protection+0x22e/0x430
        [337571.284022]  ? asm_exc_general_protection+0x22/0x30
        [337571.284251]  ? commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.284531]  btrfs_commit_transaction+0x42e/0xf90 [btrfs]
        [337571.284803]  ? _raw_spin_unlock+0x15/0x30
        [337571.285031]  ? release_extent_buffer+0x103/0x130 [btrfs]
        [337571.285305]  reset_balance_state+0x152/0x1b0 [btrfs]
        [337571.285578]  btrfs_balance+0xa50/0x11e0 [btrfs]
        [337571.285864]  ? __kmem_cache_alloc_node+0x14a/0x410
        [337571.286086]  btrfs_ioctl+0x249a/0x3320 [btrfs]
        [337571.286358]  ? mod_objcg_state+0xd2/0x360
        [337571.286577]  ? refill_obj_stock+0xb0/0x160
        [337571.286798]  ? seq_release+0x25/0x30
        [337571.287016]  ? __rseq_handle_notify_resume+0x3ba/0x4b0
        [337571.287235]  ? percpu_counter_add_batch+0x2e/0xa0
        [337571.287455]  ? __x64_sys_ioctl+0x88/0xc0
        [337571.287675]  __x64_sys_ioctl+0x88/0xc0
        [337571.287901]  do_syscall_64+0x38/0x90
        [337571.288126]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [337571.288352] RIP: 0033:0x7f478aaffe9b
      
      So fix this by locking struct btrfs_fs_info::trans_lock before deleting
      the free space root from that list.
      
      Fixes: a5ed9182 ("Btrfs: implement the free space B-tree")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      babebf02
    • Filipe Manana's avatar
      btrfs: fix race when deleting quota root from the dirty cow roots list · b31cb5a6
      Filipe Manana authored
      When disabling quotas we are deleting the quota root from the list
      fs_info->dirty_cowonly_roots without taking the lock that protects it,
      which is struct btrfs_fs_info::trans_lock. This unsynchronized list
      manipulation may cause chaos if there's another concurrent manipulation
      of this list, such as when adding a root to it with
      ctree.c:add_root_to_dirty_list().
      
      This can result in all sorts of weird failures caused by a race, such as
      the following crash:
      
        [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
        [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.279928] Code: 85 38 06 00 (...)
        [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
        [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
        [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
        [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
        [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
        [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
        [337571.281723] FS:  00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
        [337571.281950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
        [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [337571.282874] Call Trace:
        [337571.283101]  <TASK>
        [337571.283327]  ? __die_body+0x1b/0x60
        [337571.283570]  ? die_addr+0x39/0x60
        [337571.283796]  ? exc_general_protection+0x22e/0x430
        [337571.284022]  ? asm_exc_general_protection+0x22/0x30
        [337571.284251]  ? commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.284531]  btrfs_commit_transaction+0x42e/0xf90 [btrfs]
        [337571.284803]  ? _raw_spin_unlock+0x15/0x30
        [337571.285031]  ? release_extent_buffer+0x103/0x130 [btrfs]
        [337571.285305]  reset_balance_state+0x152/0x1b0 [btrfs]
        [337571.285578]  btrfs_balance+0xa50/0x11e0 [btrfs]
        [337571.285864]  ? __kmem_cache_alloc_node+0x14a/0x410
        [337571.286086]  btrfs_ioctl+0x249a/0x3320 [btrfs]
        [337571.286358]  ? mod_objcg_state+0xd2/0x360
        [337571.286577]  ? refill_obj_stock+0xb0/0x160
        [337571.286798]  ? seq_release+0x25/0x30
        [337571.287016]  ? __rseq_handle_notify_resume+0x3ba/0x4b0
        [337571.287235]  ? percpu_counter_add_batch+0x2e/0xa0
        [337571.287455]  ? __x64_sys_ioctl+0x88/0xc0
        [337571.287675]  __x64_sys_ioctl+0x88/0xc0
        [337571.287901]  do_syscall_64+0x38/0x90
        [337571.288126]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [337571.288352] RIP: 0033:0x7f478aaffe9b
      
      So fix this by locking struct btrfs_fs_info::trans_lock before deleting
      the quota root from that list.
      
      Fixes: bed92eae ("Btrfs: qgroup implementation and prototypes")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b31cb5a6
    • Naohiro Aota's avatar
      btrfs: tracepoints: also show actual number of the outstanding extents · 64425500
      Naohiro Aota authored
      The btrfs_inode_mod_outstanding_extents trace event only shows the modified
      number to the number of outstanding extents. It would be helpful if we can
      see the resulting extent number as well.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64425500
    • Jeff Layton's avatar
      btrfs: update i_version in update_dev_time · c9e561c4
      Jeff Layton authored
      When updating the ctime, we also want to update i_version.
      
      This is just something I noticed by inspection. There is probably no way
      to test this today unless you can somehow get to this inode via nfsd.
      Still, I think it's the right thing to do for consistency's sake.
      
      David Sterba's comment: I don't see anything wrong with setting the
      iversion bit, however I also don't see where this would be useful.
      Agreed with the consistency, otherwise the time is updated when device
      super block is wiped or a device initialized, both are big events so
      missing that due to lack of iversion update seems unlikely. I'll add it
      to the queue, thanks.
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      [ add comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9e561c4
    • Ben Dooks's avatar
      btrfs: make btrfs_compressed_bioset static · e794203e
      Ben Dooks authored
      The 'btrfs_compressed_bioset' struct isn't exported outside of the
      fs/btrfs/compression.c file, so make it static to fix the following
      sparse warning:
      
      fs/btrfs/compression.c:40:16: warning: symbol 'btrfs_compressed_bioset' was not declared. Should it be static?
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e794203e
    • Matt Corallo's avatar
      btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile · 160fe8f6
      Matt Corallo authored
      Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
      one allocation profile flag, and failing to do so may ultimately
      result in a WARN_ON and remount-ro when allocating new blocks, like
      the below transaction abort on 6.1.
      
      `btrfs_reduce_alloc_profile` has two ways of determining the profile,
      first it checks if a conversion balance is currently running and
      uses the profile we're converting to. If no balance is currently
      running, it returns the max-redundancy profile which at least one
      block in the selected block group has.
      
      This works by simply checking each known allocation profile bit in
      redundancy order. However, `btrfs_reduce_alloc_profile` has not been
      updated as new flags have been added - first with the `DUP` profile
      and later with the RAID1C34 profiles.
      
      Because of the way it checks, if we have blocks with different
      profiles and at least one is known, that profile will be selected.
      However, if none are known we may return a flag set with multiple
      allocation profiles set.
      
      This is currently only possible when a balance from one of the three
      unhandled profiles to another of the unhandled profiles is canceled
      after allocating at least one block using the new profile.
      
      In that case, a transaction abort like the below will occur and the
      filesystem will need to be mounted with -o skip_balance to get it
      mounted rw again (but the balance cannot be resumed without a
      similar abort).
      
        [770.648] ------------[ cut here ]------------
        [770.648] BTRFS: Transaction aborted (error -22)
        [770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
        [770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G        W 6.1.0-0.deb11.7-powerpc64le #1  Debian 6.1.20-2~bpo11+1a~test
        [770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
        [770.648] NIP:  c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
        [770.648] REGS: c000200089afe9a0 TRAP: 0700   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
        [770.648] MSR:  9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 28848282  XER: 20040000
        [770.648] CFAR: c000000000135110 IRQMASK: 0
      	    GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
      	    GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
      	    GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
      	    GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
      	    GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
      	    GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
      	    GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
      	    GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
        [770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
        [770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
        [770.648] Call Trace:
        [770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
        [770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
        [770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
        [770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
        [770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
        [770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
        [770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
        [770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
        [770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
        [770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
        [770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
        [770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
        [770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
        [770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
        [770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
        [770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
        [770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
        [770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
        [770.648] --- interrupt: c00 at 0x7fff94126800
        [770.648] NIP:  00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
        [770.648] REGS: c000200089affe80 TRAP: 0c00   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
        [770.648] MSR:  900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE>  CR: 24002848  XER: 00000000
        [770.648] IRQMASK: 0
      	    GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
      	    GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
      	    GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
      	    GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
      	    GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
      	    GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
      	    GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
      	    GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
        [770.648] NIP [00007fff94126800] 0x7fff94126800
        [770.648] LR [0000000107e0b594] 0x107e0b594
        [770.648] --- interrupt: c00
        [770.648] Instruction dump:
        [770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
        [770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
        [770.648] ---[ end trace 0000000000000000 ]---
        [770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
        [770.648] BTRFS info (device dm-2: state EA): forced readonly
        [770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
        [770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
        [770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
        [770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
      
      Fixes: 47e6f742 ("btrfs: add support for 3-copy replication (raid1c3)")
      Fixes: 8d6fac00 ("btrfs: add support for 4-copy replication (raid1c4)")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarMatt Corallo <blnxfsl@bluematt.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      160fe8f6
    • Qu Wenruo's avatar
      btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers · 81db6ae8
      Qu Wenruo authored
      Since the scrub rework introduced by commit 2af2aaf9 ("btrfs:
      scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
      and later commits, scrub only needs one single workqueue,
      fs_info::scrub_worker.
      
      That scrub_wr_completion_workers is initially to handle the delay work
      after write bios finished.  But the new scrub code goes submit-and-wait
      for write bios, thus all the work are done inside the scrub_worker.
      
      The last user of fs_info::scrub_wr_completion_workers is removed in
      commit 16f93993 ("btrfs: scrub: remove the old writeback
      infrastructure"), so we can safely remove the workqueue.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81db6ae8
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_ctx::csum_list member · c2bbc0ba
      Qu Wenruo authored
      Since the rework of scrub introduced by commit 2af2aaf9 ("btrfs:
      scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
      and later commits, scrub no longer keeps its data checksum inside sctx.
      
      Instead we have scrub_stripe::csums for the checksum of the stripe.
      Thus we can remove the unused scrub_ctx::csum_list member.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2bbc0ba
    • Filipe Manana's avatar
      btrfs: do not BUG_ON after failure to migrate space during truncation · 6822b3f7
      Filipe Manana authored
      During truncation we reserve 2 metadata units when starting a transaction
      (reserved space goes to fs_info->trans_block_rsv) and then attempt to
      migrate 1 unit (min_size bytes) from fs_info->trans_block_rsv into the
      local block reserve. If we ever fail we trigger a BUG_ON(), which should
      never happen, because we reserved 2 units. However if we happen to fail
      for some reason, we don't need to be so dire and hit a BUG_ON(), we can
      just error out the truncation and, since this is highly unexpected,
      surround the error check with WARN_ON(), to get an informative stack
      trace and tag the branh as 'unlikely'. Also make the 'min_size' variable
      const, since it's not supposed to ever change and any accidental change
      could possibly make the space migration not so unlikely to fail.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6822b3f7
    • Filipe Manana's avatar
      btrfs: do not BUG_ON on failure to get dir index for new snapshot · df9f2782
      Filipe Manana authored
      During the transaction commit path, at create_pending_snapshot(), there
      is no need to BUG_ON() in case we fail to get a dir index for the snapshot
      in the parent directory. This should fail very rarely because the parent
      inode should be loaded in memory already, with the respective delayed
      inode created and the parent inode's index_cnt field already initialized.
      
      However if it fails, it may be -ENOMEM like the comment at
      create_pending_snapshot() says or any error returned by
      btrfs_search_slot() through btrfs_set_inode_index_count(), which can be
      pretty much anything such as -EIO or -EUCLEAN for example. So the comment
      is not correct when it says it can only be -ENOMEM.
      
      However doing a BUG_ON() here is overkill, since we can instead abort
      the transaction and return the error. Note that any error returned by
      create_pending_snapshot() will eventually result in a transaction
      abort at cleanup_transaction(), called from btrfs_commit_transaction(),
      but we can explicitly abort the transaction at this point instead so that
      we get a stack trace to tell us that the call to btrfs_set_inode_index()
      failed.
      
      So just abort the transaction and return in case btrfs_set_inode_index()
      returned an error at create_pending_snapshot().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df9f2782
    • Filipe Manana's avatar
      btrfs: send: do not BUG_ON() on unexpected symlink data extent · 6f3eb72a
      Filipe Manana authored
      There's really no need to BUG_ON() if we find a symlink with an extent
      that is not inline or is compressed. We can just make send fail with
      an error (-EUCLEAN) and log an informative error message, so just do
      that instead of BUG_ON().
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f3eb72a
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() when dropping inode items from log root · fc4026e2
      Filipe Manana authored
      When dropping inode items from a log tree at drop_inode_items(), we this
      BUG_ON() on the result of btrfs_search_slot() because we don't expect an
      exact match since having a key with an offset of (u64)-1 is unexpected.
      That is generally true, but for dir index keys for example, we can get a
      key with such an offset value, even though it's very unlikely and it would
      take ages to increase the sequence counter for a dir index up to (u64)-1.
      We can deal with an exact match, we just have to delete the key at that
      slot, so there is really no need to BUG_ON(), error out or trigger any
      warning. So remove the BUG_ON().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc4026e2
    • Filipe Manana's avatar
      btrfs: replace BUG_ON() at split_item() with proper error handling · 7569141e
      Filipe Manana authored
      There's no need to BUG_ON() at split_item() if the leaf does not have
      enough free space for the new item, we can just return -ENOSPC since
      the caller can deal with errors from split_item(). Also, as this is a
      very unlikely condition to happen, because the caller has invoked
      setup_leaf_for_split() before calling split_item(), surround the
      condition with a WARN_ON() which makes it easier to notice this
      unexpected condition and tags the if branch with 'unlikely' as well.
      
      I've actually once hit this BUG_ON() with some incorrect code changes
      I had, which was very inconvenient as it required rebooting the VM.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7569141e
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr() · 751a2761
      Filipe Manana authored
      At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record
      tree mod log operations, do a transaction abort and return the error to
      the callers. There's really no need for the BUG_ON() as we can release all
      resources in the context of all callers, and we have to abort because other
      future tree searches that use the tree mod log (btrfs_search_old_slot())
      may get inconsistent results if other operations modify the tree after
      that failure and before the tree mod log based search.
      
      This implies btrfs_del_ptr() return an int instead of void, and making all
      callers check for returned errors.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      751a2761
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failures at insert_ptr() · 50b5d1fc
      Filipe Manana authored
      At insert_ptr(), instead of doing a BUG_ON() in case we fail to record
      tree mod log operations, do a transaction abort and return the error to
      the callers. There's really no need for the BUG_ON() as we can release all
      resources in the context of all callers, and we have to abort because other
      future tree searches that use the tree mod log (btrfs_search_old_slot())
      may get inconsistent results if other operations modify the tree after
      that failure and before the tree mod log based search.
      
      This implies making insert_ptr() return an int instead of void, and making
      all callers check for returned errors.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50b5d1fc
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failure at insert_new_root() · f61aa7ba
      Filipe Manana authored
      At insert_new_root(), instead of doing a BUG_ON() in case we fail to
      record the tree mod log operation, just return the error to the callers
      after releasing the allocated tree block. At this point we haven't made
      any changes to the b+tree, so we haven't left it in an inconsistent state
      and therefore have no need to abort the transaction. All we need to do is
      to unlock and free the extent buffer we just allocated with the purpose
      of making it the new root.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f61aa7ba
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert() · 11d6ae03
      Filipe Manana authored
      At push_nodes_for_insert(), instead of doing a BUG_ON() in case we fail to
      record tree mod log operations, do a transaction abort and return the
      error to the caller. There's really no need for the BUG_ON() as we can
      release all resources in this context, and we have to abort because other
      future tree searches that use the tree mod log (btrfs_search_old_slot())
      may get inconsistent results if other operations modify the tree after
      that failure and before the tree mod log based search.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11d6ae03
    • Filipe Manana's avatar
      btrfs: abort transaction at update_ref_for_cow() when ref count is zero · eced687e
      Filipe Manana authored
      At update_ref_for_cow() we are calling btrfs_handle_fs_error() if we find
      that the extent buffer has an unexpected ref count of zero, however we can
      simply use btrfs_abort_transaction(), which achieves the same purposes: to
      turn the fs to error state, abort the current transaction and turn the fs
      to RO mode as well. Besides that, btrfs_abort_transaction() also prints a
      stack trace which makes it more useful.
      
      Also, as this is a very unexpected situation, indicating a serious
      corruption/inconsistency, tag the if branch as 'unlikely', set the error
      code to -EUCLEAN instead of -EROFS, and log an explicit message.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eced687e
    • Filipe Manana's avatar
      btrfs: abort transaction at balance_level() when left child is missing · 725026ed
      Filipe Manana authored
      At balance_level() we are calling btrfs_handle_fs_error() when the middle
      child only has 1 item and the left child is missing, however we can simply
      use btrfs_abort_transaction(), which achieves the same purposes: to turn
      the fs to error state, abort the current transaction and turn the fs to
      RO mode. Besides that, btrfs_abort_transaction() also prints a stack trace
      which makes it more useful.
      
      Also, as this is a highly unexpected case and it's about a b+tree
      inconsistency, change the error code from -EROFS to -EUCLEAN, tag the if
      branch as 'unlikely' and log an explicit error message.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      725026ed
    • Filipe Manana's avatar
      btrfs: avoid unnecessarily setting the fs to RO and error state at balance_level() · 87b8e9d0
      Filipe Manana authored
      At balance_level(), when trying to promote a child node to a root node, if
      we fail to read the child we call btrfs_handle_fs_error(), which turns the
      fs to RO mode and sets it to error state as well, causing any ongoing
      transaction to abort. This however is not necessary because at that point
      we have not made any change yet at balance_level(), so any error reading
      the child node does not leaves us in any inconsistent state. Therefore we
      can just return the error to the caller.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87b8e9d0
    • Filipe Manana's avatar
      btrfs: rename enospc label to out at balance_level() · daefe4d4
      Filipe Manana authored
      At balance_level() we have this 'enospc' label where we jump to in case
      we get an error at several places. However that error is certainly not
      -ENOSPC in call cases, it can be -EIO or -ENOMEM when reading a child
      extent buffer for example, or -ENOMEM when trying to record tree mod log
      operations. So to make this less confusing, rename the label to 'out'.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      daefe4d4
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failure at balance_level() · 39020d8a
      Filipe Manana authored
      At balance_level(), instead of doing a BUG_ON() in case we fail to record
      tree mod log operations, do a transaction abort and return the error to
      the callers. There's really no need for the BUG_ON() as we can release
      all resources in this context, and we have to abort because other future
      tree searches that use the tree mod log (btrfs_search_old_slot()) may get
      inconsistent results if other operations modify the tree after that
      failure and before the tree mod log based search.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39020d8a
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failure at __btrfs_cow_block() · 40b0a749
      Filipe Manana authored
      At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to
      record a tree mod log root insertion operation, do a transaction abort
      instead. There's really no need for the BUG_ON(), we can properly
      release all resources in this context and turn the filesystem to RO mode
      and in an error state instead.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40b0a749
    • Filipe Manana's avatar
      btrfs: avoid tree mod log ENOMEM failures when we don't need to log · 8793ed87
      Filipe Manana authored
      When logging tree mod log operations we start by checking, in a lockless
      manner, if we need to log - if we don't, we just return and do nothing,
      otherwise we will allocate one or more tree mod log operations and then
      check again if we need to log. This second check will take the tree mod
      log lock in write mode if we need to log, otherwise it will do nothing
      and we just free the allocated memory and return success.
      
      We can improve on this by not returning an error in case the memory
      allocations fail, unless the second check tells us that we actually need
      to log. That is, if we fail to allocate memory and the second check tells
      use that we don't need to log, we can just return success and avoid
      returning -ENOMEM to the caller. Currently tree mod log failures are
      dealt with either a BUG_ON() or a transaction abort, as tree mod log
      operations are logged in code paths that modify a b+tree.
      
      So just avoid failing with -ENOMEM if we fail to allocate a tree mod log
      operation unless we actually need to log the operations, that is, if
      tree_mod_dont_log() returns true.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8793ed87
    • Filipe Manana's avatar
      btrfs: fix extent buffer leak after tree mod log failure at split_node() · ede600e4
      Filipe Manana authored
      At split_node(), if we fail to log the tree mod log copy operation, we
      return without unlocking the split extent buffer we just allocated and
      without decrementing the reference we own on it. Fix this by unlocking
      it and decrementing the ref count before returning.
      
      Fixes: 5de865ee ("Btrfs: fix tree mod logging")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ede600e4
    • Filipe Manana's avatar
      btrfs: add missing error handling when logging operation while COWing extent buffer · d09c5152
      Filipe Manana authored
      When COWing an extent buffer that is not the root node, we need to log in
      the tree mod log that we replaced a pointer in the parent node, otherwise
      a tree mod log user doing a search on the b+tree can return incorrect
      results (that miss something). We are doing the call to
      btrfs_tree_mod_log_insert_key() but we totally ignore its return value.
      
      So fix this by adding the missing error handling, resulting in a
      transaction abort and freeing the COWed extent buffer.
      
      Fixes: f230475e ("Btrfs: put all block modifications into the tree mod log")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d09c5152
    • Christoph Hellwig's avatar
      btrfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method · f02c75e6
      Christoph Hellwig authored
      Since commit a2ad63da ("VFS: add FMODE_CAN_ODIRECT file flag") file
      systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
      wiring up a dummy direct_IO method to indicate support for direct I/O.
      Do that for btrfs so that noop_direct_IO can eventually be removed.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f02c75e6