1. 26 Oct, 2020 2 commits
    • Pujin Shi's avatar
      btrfs: tree-checker: fix incorrect printk format · cad69d13
      Pujin Shi authored
      This patch addresses a compile warning:
      
      fs/btrfs/extent-tree.c: In function '__btrfs_free_extent':
      fs/btrfs/extent-tree.c:3187:4: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=]
      
      Fixes: 1c2a07f5 ("btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarPujin Shi <shipujin.t@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cad69d13
    • Josef Bacik's avatar
      btrfs: drop the path before adding block group sysfs files · 7837fa88
      Josef Bacik authored
      Dave reported a problem with my rwsem conversion patch where we got the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-default+ #1297 Not tainted
        ------------------------------------------------------
        kswapd0/76 is trying to acquire lock:
        ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}-{0:0}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 fs_reclaim_acquire.part.0+0x25/0x30
      	 kmem_cache_alloc+0x30/0x9c0
      	 alloc_inode+0x81/0x90
      	 iget_locked+0xcd/0x1a0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x210
      	 sysfs_get_tree+0x1a/0x50
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (kernfs_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x58/0x80
      	 sysfs_create_dir_ns+0x70/0xd0
      	 kobject_add_internal+0xbb/0x2d0
      	 kobject_add+0x7a/0xd0
      	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
      	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
      	 open_ctree+0x981/0x1108 [btrfs]
      	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (btrfs-extent-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 check_committed_ref+0x69/0x200 [btrfs]
      	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
      	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
      	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
      	 writepage_delalloc+0xae/0x160 [btrfs]
      	 __extent_writepage+0x262/0x420 [btrfs]
      	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
      	 extent_writepages+0x43/0x90 [btrfs]
      	 do_writepages+0x40/0xe0
      	 __writeback_single_inode+0x62/0x610
      	 writeback_sb_inodes+0x20f/0x500
      	 wb_writeback+0xef/0x4a0
      	 wb_do_writeback+0x49/0x2e0
      	 wb_workfn+0x81/0x340
      	 process_one_work+0x233/0x5d0
      	 worker_thread+0x50/0x3b0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-fs-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
      	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
      	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
      	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
      	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
      	 btrfs_mksubvol+0x2db/0x470 [btrfs]
      	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
      	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
      	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
      	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
      	 __x64_sys_ioctl+0x6f/0xa0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 check_prev_add+0x91/0xc60
      	 validate_chain+0xa6e/0x2a20
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
      	 evict+0xd6/0x1c0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x54/0x80
      	 super_cache_scan+0x121/0x1a0
      	 do_shrink_slab+0x16d/0x3b0
      	 shrink_slab+0xb1/0x2e0
      	 shrink_node+0x230/0x6a0
      	 balance_pgdat+0x325/0x750
      	 kswapd+0x206/0x4d0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/76:
         #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
         #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x77/0x97
         check_noncircular+0xff/0x110
         ? save_trace+0x50/0x470
         check_prev_add+0x91/0xc60
         validate_chain+0xa6e/0x2a20
         ? save_trace+0x50/0x470
         __lock_acquire+0x582/0xac0
         lock_acquire+0xca/0x430
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __mutex_lock+0xa0/0xaf0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? __lock_acquire+0x582/0xac0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? btrfs_evict_inode+0x30b/0x560 [btrfs]
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x3cc/0x560 [btrfs]
         evict+0xd6/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x121/0x1a0
         do_shrink_slab+0x16d/0x3b0
         shrink_slab+0xb1/0x2e0
         shrink_node+0x230/0x6a0
         balance_pgdat+0x325/0x750
         kswapd+0x206/0x4d0
         ? finish_wait+0x90/0x90
         ? balance_pgdat+0x750/0x750
         kthread+0x137/0x150
         ? kthread_mod_delayed_work+0xc0/0xc0
         ret_from_fork+0x1f/0x30
      
      This happens because we are still holding the path open when we start
      adding the sysfs files for the block groups, which creates a dependency
      on fs_reclaim via the tree lock.  Fix this by dropping the path before
      we start doing anything with sysfs.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7837fa88
  2. 16 Oct, 2020 1 commit
    • Filipe Manana's avatar
      btrfs: fix relocation failure due to race with fallocate · 1afc708d
      Filipe Manana authored
      When doing a fallocate() we have a short time window, after reserving an
      extent and before starting a transaction, where if relocation for the block
      group containing the reserved extent happens, we can end up missing the
      extent in the data relocation inode causing relocation to fail later.
      
      This only happens when we don't pass a transaction to the internal
      fallocate function __btrfs_prealloc_file_range(), which is for all the
      cases where fallocate() is called from user space (the internal use cases
      include space cache extent allocation and relocation).
      
      When the race triggers the relocation failure, it produces a trace like
      the following:
      
        [200611.995995] ------------[ cut here ]------------
        [200611.997084] BTRFS: Transaction aborted (error -2)
        [200611.998208] WARNING: CPU: 3 PID: 235845 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x3a0/0x5b0 [btrfs]
        [200611.999042] Modules linked in: dm_thin_pool dm_persistent_data (...)
        [200612.003287] CPU: 3 PID: 235845 Comm: btrfs Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [200612.004442] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [200612.006186] RIP: 0010:__btrfs_cow_block+0x3a0/0x5b0 [btrfs]
        [200612.007110] Code: 1b 00 00 02 72 2a 83 f8 fb 0f 84 b8 01 (...)
        [200612.007341] BTRFS warning (device sdb): Skipping commit of aborted transaction.
        [200612.008959] RSP: 0018:ffffaee38550f918 EFLAGS: 00010286
        [200612.009672] BTRFS: error (device sdb) in cleanup_transaction:1901: errno=-30 Readonly filesystem
        [200612.010428] RAX: 0000000000000000 RBX: ffff9174d96f4000 RCX: 0000000000000000
        [200612.011078] BTRFS info (device sdb): forced readonly
        [200612.011862] RDX: 0000000000000001 RSI: ffffffffa8161978 RDI: 00000000ffffffff
        [200612.013215] RBP: ffff9172569a0f80 R08: 0000000000000000 R09: 0000000000000000
        [200612.014263] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9174b8403b88
        [200612.015203] R13: ffff9174b8400a88 R14: ffff9174c90f1000 R15: ffff9174a5a60e08
        [200612.016182] FS:  00007fa55cf878c0(0000) GS:ffff9174ece00000(0000) knlGS:0000000000000000
        [200612.017174] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [200612.018418] CR2: 00007f8fb8048148 CR3: 0000000428a46003 CR4: 00000000003706e0
        [200612.019510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [200612.020648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [200612.021520] Call Trace:
        [200612.022434]  btrfs_cow_block+0x10b/0x250 [btrfs]
        [200612.023407]  do_relocation+0x54e/0x7b0 [btrfs]
        [200612.024343]  ? do_raw_spin_unlock+0x4b/0xc0
        [200612.025280]  ? _raw_spin_unlock+0x29/0x40
        [200612.026200]  relocate_tree_blocks+0x3bc/0x6d0 [btrfs]
        [200612.027088]  relocate_block_group+0x2f3/0x600 [btrfs]
        [200612.027961]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
        [200612.028896]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
        [200612.029772]  btrfs_balance+0xb22/0x1790 [btrfs]
        [200612.030601]  ? btrfs_ioctl_balance+0x253/0x380 [btrfs]
        [200612.031414]  btrfs_ioctl_balance+0x2cf/0x380 [btrfs]
        [200612.032279]  btrfs_ioctl+0x620/0x36f0 [btrfs]
        [200612.033077]  ? _raw_spin_unlock+0x29/0x40
        [200612.033948]  ? handle_mm_fault+0x116d/0x1ca0
        [200612.034749]  ? up_read+0x18/0x240
        [200612.035542]  ? __x64_sys_ioctl+0x83/0xb0
        [200612.036244]  __x64_sys_ioctl+0x83/0xb0
        [200612.037269]  do_syscall_64+0x33/0x80
        [200612.038190]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [200612.038976] RIP: 0033:0x7fa55d07ed87
        [200612.040127] Code: 00 00 00 48 8b 05 09 91 0c 00 64 c7 00 26 (...)
        [200612.041669] RSP: 002b:00007ffd5ebf03e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        [200612.042437] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa55d07ed87
        [200612.043511] RDX: 00007ffd5ebf0470 RSI: 00000000c4009420 RDI: 0000000000000003
        [200612.044250] RBP: 0000000000000003 R08: 000055d8362642a0 R09: 00007fa55d148be0
        [200612.044963] R10: fffffffffffff52e R11: 0000000000000206 R12: 00007ffd5ebf1614
        [200612.045683] R13: 00007ffd5ebf0470 R14: 0000000000000002 R15: 00007ffd5ebf0470
        [200612.046361] irq event stamp: 0
        [200612.047040] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [200612.047725] hardirqs last disabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
        [200612.048387] softirqs last  enabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
        [200612.049024] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [200612.049722] ---[ end trace 49006c6876e65227 ]---
      
      The race happens like this:
      
      1) Task A starts an fallocate() (plain or zero range) and it calls
         __btrfs_prealloc_file_range() with the 'trans' parameter set to NULL;
      
      2) Task A calls btrfs_reserve_extent() and gets an extent that belongs to
         block group X;
      
      3) Before task A gets into btrfs_replace_file_extents(), through the call
         to insert_prealloc_file_extent(), task B starts relocation of block
         group X;
      
      4) Task B enters btrfs_relocate_block_group() and it sets block group X to
         RO mode;
      
      5) Task B enters relocate_block_group(), it calls prepare_to_relocate()
         whichs joins/starts a transaction and then commits the transaction;
      
      6) Task B then starts scanning the extent tree looking for extents that
         belong to block group X - it does not find yet the extent reserved by
         task A, since that extent was not yet added to the extent tree, as its
         delayed reference was not even yet created at this point;
      
      7) The data relocation inode ends up not having the extent reserved by
         task A associated to it;
      
      8) Task A then starts a transaction through btrfs_replace_file_extents(),
         inserts a file extent item in the subvolume tree pointing to the
         reserved extent and creates a delayed reference for it;
      
      9) Task A finishes and returns success to user space;
      
      10) Later on, while relocation is still in progress, the leaf where task A
          inserted the new file extent item is COWed, so we end up at
          __btrfs_cow_block(), which calls btrfs_reloc_cow_block(), and that in
          turn calls relocation.c:replace_file_extents();
      
      11) At relocation.c:replace_file_extents() we iterate over all the items in
          the leaf and find the file extent item pointing to the extent that was
          allocated by task A, and then call relocation.c:get_new_location(), to
          find the new location for the extent;
      
      12) However relocation.c:get_new_location() fails, returning -ENOENT,
          because it couldn't find a corresponding file extent item associated
          with the data relocation inode. This is because the extent was not seen
          in the extent tree at step 6). The -ENOENT error is propagated to
          __btrfs_cow_block(), which aborts the transaction.
      
      So fix this simply by decrementing the block group's number of reservations
      after calling insert_prealloc_file_extent(), as relocation waits for that
      counter to go down to zero before calling prepare_to_relocate() and start
      looking for extents in the extent tree.
      
      This issue only started to happen recently as of commit 8fccebfa
      ("btrfs: fix metadata reservation for fallocate that leads to transaction
      aborts"), because now we can reserve an extent before starting/joining a
      transaction, and previously we always did it after that, so relocation
      ended up waiting for a concurrent fallocate() to finish because before
      searching for the extents of the block group, it starts/joins a transaction
      and then commits it (at prepare_to_relocate()), which made it wait for the
      fallocate task to complete first.
      
      Fixes: 8fccebfa ("btrfs: fix metadata reservation for fallocate that leads to transaction aborts")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1afc708d
  3. 07 Oct, 2020 37 commits
    • Nikolay Borisov's avatar
      btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag · 1fd4033d
      Nikolay Borisov authored
      Commit 8d875f95 ("btrfs: disable strict file flushes for
      renames and truncates") eliminated the notion of ordered operations and
      instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag
      indicating that a file's content should be synced to disk in case a
      file is truncated and any writes happen to it concurrently. In fact
      this intendend behavior was broken until it was fixed in
      f6dc45c7 ("Btrfs: fix filemap_flush call in btrfs_file_release").
      
      All things considered let's give the flag a more descriptive name. Also
      slightly reword comments.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fd4033d
    • Madhuparna Bhowmik's avatar
      btrfs: annotate device name rcu_string with __rcu · 8d1a7aae
      Madhuparna Bhowmik authored
      This patch fixes the following sparse errors in
      fs/btrfs/super.c in function btrfs_show_devname()
      
        fs/btrfs/super.c: error: incompatible types in comparison expression (different address spaces):
        fs/btrfs/super.c:    struct rcu_string [noderef] <asn:4> *
        fs/btrfs/super.c:    struct rcu_string *
      
      The error was because of the following line in function btrfs_show_devname():
      
        if (first_dev)
      	 seq_escape(m, rcu_str_deref(first_dev->name), " \t\n\\");
      
      Annotating the btrfs_device::name member with __rcu fixes the sparse
      error.
      Acked-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarMadhuparna Bhowmik <madhuparnabhowmik04@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d1a7aae
    • Anand Jain's avatar
      btrfs: skip devices without magic signature when mounting · 96c2e067
      Anand Jain authored
      Many things can happen after the device is scanned and before the device
      is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
      If it happens we still won't free that device from the memory and cause
      the userland confusion.
      
      For example: As the BTRFS_IOC_DEV_INFO still carries the device path
      which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
      device which does not belong to the filesystem anymore:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
        $ wipefs -a /dev/sdb
        # /dev/sdb does not contain magic signature
        $ mount -o degraded /dev/sda /btrfs
        $ btrfs fi show -m
        Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
      	  Total devices 2 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
      	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb
      
      We need to distinguish the missing signature and invalid superblock, so
      add a specific error code ENODATA for that. This also fixes failure of
      fstest btrfs/198.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96c2e067
    • Josef Bacik's avatar
      btrfs: cleanup cow block on error · 572c83ac
      Josef Bacik authored
      In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
      to a system lockup. It gets stuck trying to write back inodes, and the
      write back thread was trying to lock an extent buffer:
      
        $ cat /proc/2143497/stack
        [<0>] __btrfs_tree_lock+0x108/0x250
        [<0>] lock_extent_buffer_for_io+0x35e/0x3a0
        [<0>] btree_write_cache_pages+0x15a/0x3b0
        [<0>] do_writepages+0x28/0xb0
        [<0>] __writeback_single_inode+0x54/0x5c0
        [<0>] writeback_sb_inodes+0x1e8/0x510
        [<0>] wb_writeback+0xcc/0x440
        [<0>] wb_workfn+0xd7/0x650
        [<0>] process_one_work+0x236/0x560
        [<0>] worker_thread+0x55/0x3c0
        [<0>] kthread+0x13a/0x150
        [<0>] ret_from_fork+0x1f/0x30
      
      This is because we got an error while COWing a block, specifically here
      
              if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
                      ret = btrfs_reloc_cow_block(trans, root, buf, cow);
                      if (ret) {
                              btrfs_abort_transaction(trans, ret);
                              return ret;
                      }
              }
      
        [16402.241552] BTRFS: Transaction aborted (error -2)
        [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
        [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
        [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
        [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
        [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
        [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
        [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
        [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
        [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
        [16402.256321] FS:  00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
        [16402.256973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
        [16402.257867] Call Trace:
        [16402.258072]  btrfs_cow_block+0x109/0x230
        [16402.258356]  btrfs_search_slot+0x530/0x9d0
        [16402.258655]  btrfs_lookup_file_extent+0x37/0x40
        [16402.259155]  __btrfs_drop_extents+0x13c/0xd60
        [16402.259628]  ? btrfs_block_rsv_migrate+0x4f/0xb0
        [16402.259949]  btrfs_replace_file_extents+0x190/0x820
        [16402.260873]  btrfs_clone+0x9ae/0xc00
        [16402.261139]  btrfs_extent_same_range+0x66/0x90
        [16402.261771]  btrfs_remap_file_range+0x353/0x3b1
        [16402.262333]  vfs_dedupe_file_range_one.part.0+0xd5/0x140
        [16402.262821]  vfs_dedupe_file_range+0x189/0x220
        [16402.263150]  do_vfs_ioctl+0x552/0x700
        [16402.263662]  __x64_sys_ioctl+0x62/0xb0
        [16402.264023]  do_syscall_64+0x33/0x40
        [16402.264364]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [16402.264862] RIP: 0033:0x7f542c7d15cb
        [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
        [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
        [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
        [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
        [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
        [16402.271498] irq event stamp: 0
        [16402.271846] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273343] softirqs last  enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [16402.274338] ---[ end trace 737874a5a41a8236 ]---
        [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.276179] BTRFS info (device dm-9): forced readonly
        [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
        [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
      
      The problem here is that as soon as we allocate the new block it is
      locked and marked dirty in the btree inode.  This means that we could
      attempt to writeback this block and need to lock the extent buffer.
      However we're not unlocking it here and thus we deadlock.
      
      Fix this by unlocking the cow block if we have any errors inside of
      __btrfs_cow_block, and also free it so we do not leak it.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      572c83ac
    • Goldwyn Rodrigues's avatar
      btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK · e3c57805
      Goldwyn Rodrigues authored
      Since we now perform direct reads using i_rwsem, we can remove this
      inode flag used to co-ordinate unlocked reads.
      
      The truncate call takes i_rwsem. This means it is correctly synchronized
      with concurrent direct reads.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jth@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3c57805
    • Goldwyn Rodrigues's avatar
      fs: remove no longer used dio_end_io() · c33fe275
      Goldwyn Rodrigues authored
      Since we removed the last user of dio_end_io() when btrfs got converted
      to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
      the helper function dio_end_io().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c33fe275
    • Josef Bacik's avatar
      btrfs: return error if we're unable to read device stats · 92e26df4
      Josef Bacik authored
      I noticed when fixing device stats for seed devices that we simply threw
      away the return value from btrfs_search_slot().  This is because we may
      not have stat items, but we could very well get an error, and thus miss
      reporting the error up the chain.
      
      Fix this by returning ret if it's an actual error, and then stop trying
      to init the rest of the devices stats and return the error up the chain.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92e26df4
    • Josef Bacik's avatar
      btrfs: init device stats for seed devices · 124604eb
      Josef Bacik authored
      We recently started recording device stats across the fleet, and noticed
      a large increase in messages such as this
      
        BTRFS warning (device dm-0): get dev_stats failed, not yet valid
      
      on our tiers that use seed devices for their root devices.  This is
      because we do not initialize the device stats for any seed devices if we
      have a sprout device and mount using that sprout device.  The basic
      steps for reproducing are:
      
        $ mkfs seed device
        $ mount seed device
        # fill seed device
        $ umount seed device
        $ btrfstune -S 1 seed device
        $ mount seed device
        $ btrfs device add -f sprout device /mnt/wherever
        $ umount /mnt/wherever
        $ mount sprout device /mnt/wherever
        $ btrfs device stats /mnt/wherever
      
      This will fail with the above message in dmesg.
      
      Fix this by iterating over the fs_devices->seed if they exist in
      btrfs_init_dev_stats.  This fixed the problem and properly reports the
      stats for both devices.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_device_init_dev_stats ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      124604eb
    • Nikolay Borisov's avatar
      btrfs: remove struct extent_io_ops · 905eb88b
      Nikolay Borisov authored
      It's no longer used just remove the function and any related code which
      was initialising it for inodes. No functional changes.
      
      Removing 8 bytes from extent_io_tree in turn reduces size of other
      structures where it is embedded, notably btrfs_inode where it reduces
      size by 24 bytes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      905eb88b
    • Nikolay Borisov's avatar
      btrfs: call submit_bio_hook directly for metadata pages · 1b36294a
      Nikolay Borisov authored
      No need to go through a function pointer indirection simply call
      submit_bio_hook directly by exporting and renaming the helper to
      btrfs_submit_metadata_bio. This makes the code more readable and should
      result in somewhat faster code due to no longer paying the price for
      specualtive attack mitigations that come with indirect function calls.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b36294a
    • Nikolay Borisov's avatar
      btrfs: stop calling submit_bio_hook for data inodes · 908930f3
      Nikolay Borisov authored
      Instead export and rename the function to btrfs_submit_data_bio and
      call it directly in submit_one_bio. This avoids paying the cost for
      speculative attacks mitigations and improves code readability.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      908930f3
    • Nikolay Borisov's avatar
      btrfs: don't opencode is_data_inode in end_bio_extent_readpage · be17b3af
      Nikolay Borisov authored
      Use the is_data_inode helper.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be17b3af
    • Nikolay Borisov's avatar
      btrfs: call submit_bio_hook directly in submit_one_bio · cd053744
      Nikolay Borisov authored
      BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
      - ordinary data inodes (including the freespace inode) and the btree
      inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
      never be called from submit_one_bio so just remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd053744
    • Nikolay Borisov's avatar
      btrfs: remove extent_io_ops::readpage_end_io_hook · 1f03d9cf
      Nikolay Borisov authored
      It's no longer used so let's remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f03d9cf
    • Nikolay Borisov's avatar
      btrfs: replace readpage_end_io_hook with direct calls · 9a446d6a
      Nikolay Borisov authored
      Don't call readpage_end_io_hook for the btree inode.  Instead of relying
      on indirect calls to implement metadata buffer validation simply check
      if the inode whose page we are processing equals the btree inode. If it
      does call the necessary function.
      
      This is an improvement in 2 directions:
      
      1. We aren't paying the penalty of indirect calls in a post-speculation
         attacks world.
      
      2. The function is now named more explicitly so it's obvious what's
         going on
      
      This is in preparation to removing struct extent_io_ops altogether.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a446d6a
    • Filipe Manana's avatar
      btrfs: send, recompute reference path after orphanization of a directory · 9c2b4e03
      Filipe Manana authored
      During an incremental send, when an inode has multiple new references we
      might end up emitting rename operations for orphanizations that have a
      source path that is no longer valid due to a previous orphanization of
      some directory inode. This causes the receiver to fail since it tries
      to rename a path that does not exists.
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/f1
        touch /mnt/sdi/f2
        mkdir /mnt/sdi/d1
        mkdir /mnt/sdi/d1/d2
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- f1                   (ino 257)
        # |----- f2                   (ino 258)
        # |----- d1/                  (ino 259)
        #        |----- d2/           (ino 260)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now do a series of changes such that:
        #
        # *) inode 258 has one new hardlink and the previous name changed
        #
        # *) both names conflict with the old names of two other inodes:
        #
        #    1) the new name "d1" conflicts with the old name of inode 259,
        #       under directory inode 256 (root)
        #
        #    2) the new name "d2" conflicts with the old name of inode 260
        #       under directory inode 259
        #
        # *) inodes 259 and 260 now have the old names of inode 258
        #
        # *) inode 257 is now located under inode 260 - an inode with a number
        #    smaller than the inode (258) for which we created a second hard
        #    link and swapped its names with inodes 259 and 260
        #
        ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
        mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
      
        # Swap d1 and f2.
        mv /mnt/sdi/d1 /mnt/sdi/tmp
        mv /mnt/sdi/f2 /mnt/sdi/d1
        mv /mnt/sdi/tmp /mnt/sdi/f2
      
        # Swap d2 and f2_link
        mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
        mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
        mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
      
        # Filesystem now looks like:
        #
        # .                                (ino 256)
        # |----- d1                        (ino 258)
        # |----- f2/                       (ino 259)
        #        |----- f2_link/           (ino 260)
        #        |       |----- f1         (ino 257)
        #        |
        #        |----- d2                 (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When executed the receive of the incremental stream fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
      
      This happens because:
      
      1) When processing inode 257 we end up computing the name for inode 259
         because it is an ancestor in the send snapshot, and at that point it
         still has its old name, "d1", from the parent snapshot because inode
         259 was not yet processed. We then cache that name, which is valid
         until we start processing inode 259 (or set the progress to 260 after
         processing its references);
      
      2) Later we start processing inode 258 and collecting all its new
         references into the list sctx->new_refs. The first reference in the
         list happens to be the reference for name "d1" while the reference for
         name "d2" is next (the last element of the list).
         We compute the full path "d1/d2" for this second reference and store
         it in the reference (its ->full_path member). The path used for the
         new parent directory was "d1" and not "f2" because inode 259, the
         new parent, was not yet processed;
      
      3) When we start processing the new references at process_recorded_refs()
         we start with the first reference in the list, for the new name "d1".
         Because there is a conflicting inode that was not yet processed, which
         is directory inode 259, we orphanize it, renaming it from "d1" to
         "o259-6-0";
      
      4) Then we start processing the new reference for name "d2", and we
         realize it conflicts with the reference of inode 260 in the parent
         snapshot. So we issue an orphanization operation for inode 260 by
         emitting a rename operation with a destination path of "o260-6-0"
         and a source path of "d1/d2" - this source path is the value we
         stored in the reference earlier at step 2), corresponding to the
         ->full_path member of the reference, however that path is no longer
         valid due to the orphanization of the directory inode 259 in step 3).
         This makes the receiver fail since the path does not exists, it should
         have been "o259-6-0/d2".
      
      Fix this by recomputing the full path of a reference before emitting an
      orphanization if we previously orphanized any directory, since that
      directory could be a parent in the new path. This is a rare scenario so
      keeping it simple and not checking if that previously orphanized directory
      is in fact an ancestor of the inode we are trying to orphanize.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c2b4e03
    • Filipe Manana's avatar
      btrfs: send, orphanize first all conflicting inodes when processing references · 98272bb7
      Filipe Manana authored
      When doing an incremental send it is possible that when processing the new
      references for an inode we end up issuing rename or link operations that
      have an invalid path, which contains the orphanized name of a directory
      before we actually orphanized it, causing the receiver to fail.
      
      The following reproducer triggers such scenario:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/a
        touch /mnt/sdi/b
        mkdir /mnt/sdi/testdir
        # We want "a" to have a lower inode number then "testdir" (257 vs 259).
        mv /mnt/sdi/a /mnt/sdi/testdir/a
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- testdir/             (ino 259)
        # |          |----- a         (ino 257)
        # |
        # |----- b                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now rename 259 to "testdir_2", then change the name of 257 to
        # "testdir" and make it a direct descendant of the root inode (256).
        # Also create a new link for inode 257 with the old name of inode 258.
        # By swapping the names and location of several inodes and create a
        # nasty dependency chain of rename and link operations.
        mv /mnt/sdi/testdir/a /mnt/sdi/a2
        touch /mnt/sdi/testdir/a
        mv /mnt/sdi/b /mnt/sdi/b2
        ln /mnt/sdi/a2 /mnt/sdi/b
        mv /mnt/sdi/testdir /mnt/sdi/testdir_2
        mv /mnt/sdi/a2 /mnt/sdi/testdir
      
        # Filesystem now looks like:
        #
        # .                            (ino 256)
        # |----- testdir_2/            (ino 259)
        # |          |----- a          (ino 260)
        # |
        # |----- testdir               (ino 257)
        # |----- b                     (ino 257)
        # |----- b2                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When running the reproducer, the receive of the incremental send stream
      fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: link b -> o259-6-0/a failed: No such file or directory
      
      The problem happens because of the following:
      
      1) Before we start iterating the list of new references for inode 257,
         we generate its current path and store it at @valid_path, done at
         the very beginning of process_recorded_refs(). The generated path
         is "o259-6-0/a", containing the orphanized name for inode 259;
      
      2) Then we iterate over the list of new references, which has the
         references "b" and "testdir" in that specific order;
      
      3) We process reference "b" first, because it is in the list before
         reference "testdir". We then issue a link operation to create
         the new reference "b" using a target path corresponding to the
         content at @valid_path, which corresponds to "o259-6-0/a".
         However we haven't yet orphanized inode 259, its name is still
         "testdir", and not "o259-6-0". The orphanization of 259 did not
         happen yet because we will process the reference named "testdir"
         for inode 257 only in the next iteration of the loop that goes
         over the list of new references.
      
      Fix the issue by having a preliminar iteration over all the new references
      at process_recorded_refs(). This iteration is responsible only for doing
      the orphanization of other inodes that have and old reference that
      conflicts with one of the new references of the inode we are currently
      processing. The emission of rename and link operations happen now in the
      next iteration of the new references.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98272bb7
    • Qu Wenruo's avatar
      btrfs: tree-checker: fix false alert caused by legacy btrfs root item · 1465af12
      Qu Wenruo authored
      Commit 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      introduced btrfs root item size check, however btrfs root item has two
      versions, the legacy one which just ends before generation_v2 member, is
      smaller than current btrfs root item size.
      
      This caused btrfs kernel to reject valid but old tree root leaves.
      
      Fix this problem by also allowing legacy root item, since kernel can
      already handle them pretty well and upgrade to newer root item format
      when needed.
      Reported-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Fixes: 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Tested-By: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1465af12
    • David Sterba's avatar
      btrfs: use unaligned helpers for stack and header set/get helpers · e97659ce
      David Sterba authored
      In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
      pointer assignment but we should use the helpers for unaligned access
      for clarity. It hasn't been a problem so far because of the natural
      alignment.
      
      Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
      from stack that has an aligned start but some members may not be aligned
      due to packing. This as well hasn't caused problems so far.
      
      Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e97659ce
    • David Sterba's avatar
      btrfs: free-space-cache: use unaligned helpers to access data · 6994ca36
      David Sterba authored
      The free space inode stores the tracking data, checksums etc, using the
      io_ctl structure and moving the pointers. The data are generally aligned
      to at least 4 bytes (u32 for CRC) so it's not completely unaligned but
      for clarity we should use the proper helpers whenever a struct is
      initialized from io_ctl->cur pointer.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6994ca36
    • David Sterba's avatar
      btrfs: send: use helpers for unaligned access to header members · e2f896b3
      David Sterba authored
      The header is mapped onto the send buffer and thus its members may be
      potentially unaligned so use the helpers instead of directly assigning
      the pointers. This has worked so far but let's use the helpers to make
      that clear.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2f896b3
    • Qu Wenruo's avatar
      btrfs: use own btree inode io_tree owner id · 2c53a14d
      Qu Wenruo authored
      Btree inode is special compared to all other inode extent io_trees,
      although it has a btrfs inode, it doesn't have the track_uptodate bit at
      all.
      
      This means a lot of things like extent locking doesn't even need to be
      applied to btree io tree.
      
      Since it's so special, adds a new owner value for it to make debuging a
      little easier.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c53a14d
    • Johannes Thumshirn's avatar
      btrfs: reschedule when cloning lots of extents · 6b613cc9
      Johannes Thumshirn authored
      We have several occurrences of a soft lockup from fstest's generic/175
      testcase, which look more or less like this one:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030]
        Kernel panic - not syncing: softlockup: hung tasks
        CPU: 0 PID: 10030 Comm: xfs_io Tainted: G             L    5.9.0-rc5+ #768
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x77/0xa0
         panic+0xfa/0x2cb
         watchdog_timer_fn.cold+0x85/0xa5
         ? lockup_detector_update_enable+0x50/0x50
         __hrtimer_run_queues+0x99/0x4c0
         ? recalibrate_cpu_khz+0x10/0x10
         hrtimer_run_queues+0x9f/0xb0
         update_process_times+0x28/0x80
         tick_handle_periodic+0x1b/0x60
         __sysvec_apic_timer_interrupt+0x76/0x210
         asm_call_on_stack+0x12/0x20
         </IRQ>
         sysvec_apic_timer_interrupt+0x7f/0x90
         asm_sysvec_apic_timer_interrupt+0x12/0x20
        RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs]
        RSP: 0018:ffffc90007123a58 EFLAGS: 00000282
        RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000
        RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0
        RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032
        R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000
        R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0
         ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs]
         btrfs_release_path+0x67/0x80 [btrfs]
         btrfs_insert_replace_extent+0x177/0x2c0 [btrfs]
         btrfs_replace_file_extents+0x472/0x7c0 [btrfs]
         btrfs_clone+0x9ba/0xbd0 [btrfs]
         btrfs_clone_files.isra.0+0xeb/0x140 [btrfs]
         ? file_update_time+0xcd/0x120
         btrfs_remap_file_range+0x322/0x3b0 [btrfs]
         do_clone_file_range+0xb7/0x1e0
         vfs_clone_file_range+0x30/0xa0
         ioctl_file_clone+0x8a/0xc0
         do_vfs_ioctl+0x5b2/0x6f0
         __x64_sys_ioctl+0x37/0xa0
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f87977fc247
        RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247
        RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003
        RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000
        R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000
        R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000
        Kernel Offset: disabled
        ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---
      
      All of these lockup reports have the call chain btrfs_clone_files() ->
      btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with
      both source and destination extents locked and loops over the source
      extent to create the clones.
      
      Conditionally reschedule in the btrfs_clone() loop, to give some time back
      to other processes.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b613cc9
    • Denis Efremov's avatar
      btrfs: use kvcalloc for allocation in btrfs_ioctl_send() · bae12df9
      Denis Efremov authored
      Replace kvzalloc() call with kvcalloc() that also checks the size
      internally. There's a standalone overflow check in the function so we
      can return invalid parameter combination.  Use array_size() helper to
      compute the memory size for clone_sources_tmp.
      
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bae12df9
    • Denis Efremov's avatar
      btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send() · 8eb2fd00
      Denis Efremov authored
      btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
      The code was accidentally replaced with kzalloc() call [1]. Restore
      the original code by using kvzalloc() to allocate sctx->clone_roots.
      
      [1] https://patchwork.kernel.org/patch/9757891/#20529627
      
      Fixes: 818e010b ("btrfs: replace opencoded kvzalloc with the helper")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8eb2fd00
    • Nikolay Borisov's avatar
      btrfs: remove inode argument from btrfs_start_ordered_extent · c0a43603
      Nikolay Borisov authored
      The passed in ordered_extent struct is always well-formed and contains
      the inode making the explicit argument redundant.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0a43603
    • Nikolay Borisov's avatar
      btrfs: remove inode argument from add_pending_csums · 510f85ed
      Nikolay Borisov authored
      It's used to reference the csum root which can be done from the trans
      handle as well. Simplify the signature and while at it also remove the
      noinline attribute as the function uses only at most 16 bytes of stack
      space.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      510f85ed
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
      btrfs: open code extent_read_full_page to its sole caller · 0f208812
      Nikolay Borisov authored
      This makes reading the code a tad easier by decreasing the level of
      indirection by one.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f208812
    • Nikolay Borisov's avatar
      btrfs: sink mirror_num argument in __do_readpage · fd513000
      Nikolay Borisov authored
      It's always set to 0 by the 2 callers so move it inside __do_readpage.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd513000
    • Nikolay Borisov's avatar
      btrfs: sink read_flags argument into extent_read_full_page · 6f15af60
      Nikolay Borisov authored
      It's always set to 0 by its sole caller - btrfs_readpage. Simply remove
      it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f15af60
    • Nikolay Borisov's avatar
      btrfs: sink mirror_num argument in extent_read_full_page · 003c286a
      Nikolay Borisov authored
      It's always set to 0 from the sole caller - btrfs_readpage.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      003c286a
    • Nikolay Borisov's avatar
      btrfs: promote extent_read_full_page to btrfs_readpage · c1be9c1a
      Nikolay Borisov authored
      Now that btrfs_readpage is the only caller of extent_read_full_page the
      latter can be open coded in the former. Use the occassion to rename
      __extent_read_full_page to extent_read_full_page. To facillitate this
      change submit_one_bio has to be exported as well.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1be9c1a
    • Nikolay Borisov's avatar
      btrfs: remove mirror_num argument from extent_read_full_page · 72cffee4
      Nikolay Borisov authored
      It's called only from btrfs_readpage which always passes 0 so just sink
      the argument into extent_read_full_page.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72cffee4
    • Nikolay Borisov's avatar
      btrfs: remove btrfs_get_extent indirection from __do_readpage · 1a5ee1e6
      Nikolay Borisov authored
      Now that this function is only responsible for reading data pages it's
      no longer necessary to pass get_extent_t parameter across several
      layers of functions. This patch removes this parameter from multiple
      functions: __get_extent_map/__do_readpage/__extent_read_full_page/
      extent_read_full_page and simply calls btrfs_get_extent directly in
      __get_extent_map.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a5ee1e6