1. 25 Feb, 2019 40 commits
    • Filipe Manana's avatar
      Btrfs: fix fsync after succession of renames of different files · 6b5fc433
      Filipe Manana authored
      After a succession of rename operations of different files and fsyncing
      one of them, such that each file gets a new name that corresponds to an
      old name of another file, we can end up with a log that will cause a
      failure when attempted to replay at mount time (an EEXIST error).
      We currently have correct behaviour when such succession of renames
      involves only two files, but if there are more files involved, we end up
      not logging all the inodes that are needed, therefore resulting in a
      failure when attempting to replay the log.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/fname1
        $ touch /mnt/testdir/fname2
      
        $ sync
      
        $ mv /mnt/testdir/fname1 /mnt/testdir/fname3
        $ mv /mnt/testdir/fname2 /mnt/testdir/fname4
        $ ln /mnt/testdir/fname3 /mnt/testdir/fname2
      
        $ touch /mnt/testdir/fname1
        $ xfs_io -c "fsync" /mnt/testdir/fname1
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        mount: mount /dev/sdb on /mnt failed: File exists
      
      So fix this by checking all inode dependencies when logging an inode. That
      is, if one logged inode A has a new name that matches the old name of some
      other inode B, check if inode B has a new name that matches the old name
      of some other inode C, and so on. This fix is implemented not by doing any
      recursive function calls but by using an iterative method using a linked
      list that is used in a first-in-first-out fashion.
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b5fc433
    • Josef Bacik's avatar
      btrfs: honor path->skip_locking in backref code · 38e3eebf
      Josef Bacik authored
      Qgroups will do the old roots lookup at delayed ref time, which could be
      while walking down the extent root while running a delayed ref.  This
      should be fine, except we specifically lock eb's in the backref walking
      code irrespective of path->skip_locking, which deadlocks the system.
      Fix up the backref code to honor path->skip_locking, nobody will be
      modifying the commit_root when we're searching so it's completely safe
      to do.
      
      This happens since fb235dc0 ("btrfs: qgroup: Move half of the qgroup
      accounting time out of commit trans"), kernel may lockup with quota
      enabled.
      
      There is one backref trace triggered by snapshot dropping along with
      write operation in the source subvolume.  The example can be reliably
      reproduced:
      
        btrfs-cleaner   D    0  4062      2 0x80000000
        Call Trace:
         schedule+0x32/0x90
         btrfs_tree_read_lock+0x93/0x130 [btrfs]
         find_parent_nodes+0x29b/0x1170 [btrfs]
         btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
         btrfs_find_all_roots+0x57/0x70 [btrfs]
         btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
         btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
         btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
         do_walk_down+0x541/0x5e3 [btrfs]
         walk_down_tree+0xab/0xe7 [btrfs]
         btrfs_drop_snapshot+0x356/0x71a [btrfs]
         btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
         cleaner_kthread+0x12b/0x160 [btrfs]
         kthread+0x112/0x130
         ret_from_fork+0x27/0x50
      
      When dropping snapshots with qgroup enabled, we will trigger backref
      walk.
      
      However such backref walk at that timing is pretty dangerous, as if one
      of the parent nodes get WRITE locked by other thread, we could cause a
      dead lock.
      
      For example:
      
                 FS 260     FS 261 (Dropped)
                  node A        node B
                 /      \      /      \
             node C      node D      node E
            /   \         /  \        /     \
        leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
      
      The lock sequence would be:
      
            Thread A (cleaner)             |       Thread B (other writer)
      -----------------------------------------------------------------------
      write_lock(B)                        |
      write_lock(D)                        |
      ^^^ called by walk_down_tree()       |
                                           |       write_lock(A)
                                           |       write_lock(D) << Stall
      read_lock(H) << for backref walk     |
      read_lock(D) << lock owner is        |
                      the same thread A    |
                      so read lock is OK   |
      read_lock(A) << Stall                |
      
      So thread A hold write lock D, and needs read lock A to unlock.
      While thread B holds write lock A, while needs lock D to unlock.
      
      This will cause a deadlock.
      
      This is not only limited to snapshot dropping case.  As the backref
      walk, even only happens on commit trees, is breaking the normal top-down
      locking order, makes it deadlock prone.
      
      Fixes: fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: default avatarDavid Sterba <dsterba@suse.com>
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [ rebase to latest branch and fix lock assert bug in btrfs/007 ]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ copy logs and deadlock analysis from Qu's patch ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38e3eebf
    • Qu Wenruo's avatar
      btrfs: qgroup: Make qgroup async transaction commit more aggressive · f5fef459
      Qu Wenruo authored
      [BUG]
      Btrfs qgroup will still hit EDQUOT under the following case:
      
        $ dev=/dev/test/test
        $ mnt=/mnt/btrfs
        $ umount $mnt &> /dev/null
        $ umount $dev &> /dev/null
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt -o nospace_cache
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ fallocate -l 900M $mnt/subv/padding
        $ sync
      
        $ rm $mnt/subv/padding
      
        # Hit EDQUOT
        $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file
      
      [CAUSE]
      Since commit a514d638 ("btrfs: qgroup: Commit transaction in advance
      to reduce early EDQUOT"), btrfs is not forced to commit transaction to
      reclaim more quota space.
      
      Instead, we just check pertrans metadata reservation against some
      threshold and try to do asynchronously transaction commit.
      
      However in above case, the pertrans metadata reservation is pretty small
      thus it will never trigger asynchronous transaction commit.
      
      [FIX]
      Instead of only accounting pertrans metadata reservation, we calculate
      how much free space we have, and if there isn't much free space left,
      commit transaction asynchronously to try to free some space.
      
      This may slow down the fs when we have less than 32M free qgroup space,
      but should reduce a lot of false EDQUOT, so the cost should be
      acceptable.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5fef459
    • Qu Wenruo's avatar
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
      Qu Wenruo authored
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
      
      [BUG]
      Btrfs/139 will fail with a high probability if the testing machine (VM)
      has only 2G RAM.
      
      Resulting the final write success while it should fail due to EDQUOT,
      and the fs will have quota exceeding the limit by 16K.
      
      The simplified reproducer will be: (needs a 2G ram VM)
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ for i in $(seq -w  1 8); do
        	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
        	echo "file $i written" > /dev/kmsg
          done
        $ sync
        $ btrfs qgroup show -pcre --raw $mnt
      
      The last pwrite will not trigger EDQUOT and final 'qgroup show' will
      show something like:
      
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5             16384        16384         none         none ---     ---
        0/256      1073758208   1073758208         none   1073741824 ---     ---
      
      And 1073758208 is larger than
        > 1073741824.
      
      [CAUSE]
      It's a bug in btrfs qgroup data reserved space management.
      
      For quota limit, we must ensure that:
        reserved (data + metadata) + rfer/excl <= limit
      
      Since rfer/excl is only updated at transaction commmit time, reserved
      space needs to be taken special care.
      
      One important part of reserved space is data, and for a new data extent
      written to disk, we still need to take the reserved space until
      rfer/excl numbers get updated.
      
      Originally when an ordered extent finishes, we migrate the reserved
      qgroup data space from extent_io tree to delayed ref head of the data
      extent, expecting delayed ref will only be cleaned up at commit
      transaction time.
      
      However for small RAM machine, due to memory pressure dirty pages can be
      flushed back to disk without committing a transaction.
      
      The related events will be something like:
      
        file 1 written
        btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
        btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
        btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
        btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
        btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
        cleanup_ref_head: num_bytes=54947840
        cleanup_ref_head: num_bytes=5636096
        cleanup_ref_head: num_bytes=569344
        cleanup_ref_head: num_bytes=57344
        cleanup_ref_head: num_bytes=8192
        ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
        file 2 written
        ...
        file 8 written
        cleanup_ref_head: num_bytes=8192
        ...
        btrfs_commit_transaction  <<< the only transaction committed during
      				the test
      
      When file 2 is written, we have already freed 128M reserved qgroup data
      space for ino 258. Thus later write won't trigger EDQUOT.
      
      This allows us to write more data beyond qgroup limit.
      
      In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
      
      [FIX]
      By moving reserved qgroup data space from btrfs_delayed_ref_head to
      btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
      space won't be freed half way before commit transaction, thus fix the
      problem.
      
      Fixes: f64d5ca8 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1418bae1
    • David Sterba's avatar
      btrfs: scrub: remove unused nocow worker pointer · 0ea82076
      David Sterba authored
      The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
      optimization was removed from scrub in 9bebe665 ("btrfs: scrub:
      Remove unused copy_nocow_pages and its callchain").
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ea82076
    • David Sterba's avatar
      btrfs: scrub: add assertions for worker pointers · c8352942
      David Sterba authored
      The scrub worker pointers are not NULL iff the scrub is running, so
      reset them back once the last reference is dropped. Add assertions to
      the initial phase of scrub to verify that.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8352942
    • Anand Jain's avatar
      btrfs: scrub: convert scrub_workers_refcnt to refcount_t · ff09c4ca
      Anand Jain authored
      Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
      we get the extra checks. All reference changes are still done under
      scrub_lock.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff09c4ca
    • Anand Jain's avatar
      btrfs: scrub: add scrub_lock lockdep check in scrub_workers_get · eb4318e5
      Anand Jain authored
      scrub_workers_refcnt is protected by scrub_lock, add lockdep_assert_held()
      in scrub_workers_get().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb4318e5
    • Anand Jain's avatar
      btrfs: scrub: fix circular locking dependency warning · 1cec3f27
      Anand Jain authored
      This fixes a longstanding lockdep warning triggered by
      fstests/btrfs/011.
      
      Circular locking dependency check reports warning[1], that's because the
      btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock
      held. The test case leading to this warning:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /btrfs
        $ btrfs scrub start -B /btrfs
      
      In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy
      of the scrub workers are needed. So once we have incremented and decremented
      the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the
      scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this
      patch drops the scrub_lock before calling btrfs_destroy_workqueue().
      
        [359.258534] ======================================================
        [359.260305] WARNING: possible circular locking dependency detected
        [359.261938] 5.0.0-rc6-default #461 Not tainted
        [359.263135] ------------------------------------------------------
        [359.264672] btrfs/20975 is trying to acquire lock:
        [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540
        [359.268416]
        [359.268416] but task is already holding lock:
        [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.272418]
        [359.272418] which lock already depends on the new lock.
        [359.272418]
        [359.274692]
        [359.274692] the existing dependency chain (in reverse order) is:
        [359.276671]
        [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}:
        [359.278187]        __mutex_lock+0x86/0x9c0
        [359.279086]        btrfs_scrub_pause+0x31/0x100 [btrfs]
        [359.280421]        btrfs_commit_transaction+0x1e4/0x9e0 [btrfs]
        [359.281931]        close_ctree+0x30b/0x350 [btrfs]
        [359.283208]        generic_shutdown_super+0x64/0x100
        [359.284516]        kill_anon_super+0x14/0x30
        [359.285658]        btrfs_kill_super+0x12/0xa0 [btrfs]
        [359.286964]        deactivate_locked_super+0x29/0x60
        [359.288242]        cleanup_mnt+0x3b/0x70
        [359.289310]        task_work_run+0x98/0xc0
        [359.290428]        exit_to_usermode_loop+0x83/0x90
        [359.291445]        do_syscall_64+0x15b/0x180
        [359.292598]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.294011]
        [359.294011] -> #2 (sb_internal#2){.+.+}:
        [359.295432]        __sb_start_write+0x113/0x1d0
        [359.296394]        start_transaction+0x369/0x500 [btrfs]
        [359.297471]        btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs]
        [359.298629]        normal_work_helper+0xcd/0x530 [btrfs]
        [359.299698]        process_one_work+0x246/0x610
        [359.300898]        worker_thread+0x3c/0x390
        [359.302020]        kthread+0x116/0x130
        [359.303053]        ret_from_fork+0x24/0x30
        [359.304152]
        [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}:
        [359.306100]        process_one_work+0x21f/0x610
        [359.307302]        worker_thread+0x3c/0x390
        [359.308465]        kthread+0x116/0x130
        [359.309357]        ret_from_fork+0x24/0x30
        [359.310229]
        [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
        [359.311812]        lock_acquire+0x90/0x180
        [359.312929]        flush_workqueue+0xaa/0x540
        [359.313845]        drain_workqueue+0xa1/0x180
        [359.314761]        destroy_workqueue+0x17/0x240
        [359.315754]        btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.317245]        scrub_workers_put+0x2c/0x60 [btrfs]
        [359.318585]        btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.319944]        btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.321622]        btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.322908]        do_vfs_ioctl+0xa2/0x6d0
        [359.324021]        ksys_ioctl+0x3a/0x70
        [359.325066]        __x64_sys_ioctl+0x16/0x20
        [359.326236]        do_syscall_64+0x54/0x180
        [359.327379]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.328772]
        [359.328772] other info that might help us debug this:
        [359.328772]
        [359.330990] Chain exists of:
        [359.330990]   (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock
        [359.330990]
        [359.334376]  Possible unsafe locking scenario:
        [359.334376]
        [359.336020]        CPU0                    CPU1
        [359.337070]        ----                    ----
        [359.337821]   lock(&fs_info->scrub_lock);
        [359.338506]                                lock(sb_internal#2);
        [359.339506]                                lock(&fs_info->scrub_lock);
        [359.341461]   lock((wq_completion)"%s-%s""btrfs", name);
        [359.342437]
        [359.342437]  *** DEADLOCK ***
        [359.342437]
        [359.343745] 1 lock held by btrfs/20975:
        [359.344788]  #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.346778]
        [359.346778] stack backtrace:
        [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461
        [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [359.350501] Call Trace:
        [359.350931]  dump_stack+0x67/0x90
        [359.351676]  print_circular_bug.isra.37.cold.56+0x15c/0x195
        [359.353569]  check_prev_add.constprop.44+0x4f9/0x750
        [359.354849]  ? check_prev_add.constprop.44+0x286/0x750
        [359.356505]  __lock_acquire+0xb84/0xf10
        [359.357505]  lock_acquire+0x90/0x180
        [359.358271]  ? flush_workqueue+0x87/0x540
        [359.359098]  flush_workqueue+0xaa/0x540
        [359.359912]  ? flush_workqueue+0x87/0x540
        [359.360740]  ? drain_workqueue+0x1e/0x180
        [359.361565]  ? drain_workqueue+0xa1/0x180
        [359.362391]  drain_workqueue+0xa1/0x180
        [359.363193]  destroy_workqueue+0x17/0x240
        [359.364539]  btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.365673]  scrub_workers_put+0x2c/0x60 [btrfs]
        [359.366618]  btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.367594]  ? start_transaction+0xa1/0x500 [btrfs]
        [359.368679]  btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.369545]  btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.370186]  ? __lock_acquire+0x263/0xf10
        [359.370777]  ? kvm_clock_read+0x14/0x30
        [359.371392]  ? kvm_sched_clock_read+0x5/0x10
        [359.372248]  ? sched_clock+0x5/0x10
        [359.372786]  ? sched_clock_cpu+0xc/0xc0
        [359.373662]  ? do_vfs_ioctl+0xa2/0x6d0
        [359.374552]  do_vfs_ioctl+0xa2/0x6d0
        [359.375378]  ? do_sigaction+0xff/0x250
        [359.376233]  ksys_ioctl+0x3a/0x70
        [359.376954]  __x64_sys_ioctl+0x16/0x20
        [359.377772]  do_syscall_64+0x54/0x180
        [359.378841]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.380422] RIP: 0033:0x7f5429296a97
      
      Backporting to older kernels: scrub_nocow_workers must be freed the same
      way as the others.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cec3f27
    • Anand Jain's avatar
      btrfs: fix comment its device list mutex not volume lock · 7faad6e2
      Anand Jain authored
      We have killed volume mutex (commit: dccdb07b
      btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have
      escaped.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7faad6e2
    • Qu Wenruo's avatar
      btrfs: extent_io: Kill the forward declaration of flush_write_bio · bb58eb9e
      Qu Wenruo authored
      There is no need to forward declare flush_write_bio(), as it only
      depends on submit_one_bio().  Both of them are pretty small, just move
      them to kill the forward declaration.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb58eb9e
    • Nikolay Borisov's avatar
      btrfs: Fix grossly misleading argument names in extent io search · 352646c7
      Nikolay Borisov authored
      The variables and function parameters of __etree_search which pertain to
      prev/next are grossly misnamed. Namely, prev_ret holds the next state
      and not the previous. Similarly, next_ret actually holds the previous
      extent state relating to the offset we are interested in. Fix this by
      renaming the variables as well as switching the arguments order. No
      functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      352646c7
    • Nikolay Borisov's avatar
      btrfs: Remove EXTENT_FIRST_DELALLOC bit · ba8f5206
      Nikolay Borisov authored
      With the refactoring introduced in 8b62f87b ("Btrfs: reworki
      outstanding_extents") this flag became unused. Remove it and renumber
      the following flags accordingly. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba8f5206
    • Nikolay Borisov's avatar
      btrfs: use WARN_ON in a canonical form btrfs_remove_block_group · 9a0ec83d
      Nikolay Borisov authored
      There is no point in using a construct like 'if (!condition)
      WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a0ec83d
    • Josef Bacik's avatar
      btrfs: reserve extra space during evict · 260e7702
      Josef Bacik authored
      We could generate a lot of delayed refs in evict but never have any left
      over space from our block rsv to make up for that fact.  So reserve some
      extra space and give it to the transaction so it can be used to refill
      the delayed refs rsv every loop through the truncate path.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      260e7702
    • Josef Bacik's avatar
      btrfs: be more explicit about allowed flush states · 8a1bbe1d
      Josef Bacik authored
      For FLUSH_LIMIT flushers we really can only allocate chunks and flush
      delayed inode items, everything else is problematic.  I added a bunch of
      new states and it lead to weirdness in the FLUSH_LIMIT case because I
      forgot about how it worked.  So instead explicitly declare the states
      that are ok for flushing with FLUSH_LIMIT and use that for our state
      machine.  Then as we add new things that are safe we can just add them
      to this list.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a1bbe1d
    • Josef Bacik's avatar
      btrfs: loop in inode_rsv_refill · 5df11363
      Josef Bacik authored
      With severe fragmentation we can end up with our inode rsv size being
      huge during writeout, which would cause us to need to make very large
      metadata reservations.
      
      However we may not actually need that much once writeout is complete,
      because of the over-reservation for the worst case.
      
      So instead try to make our reservation, and if we couldn't make it
      re-calculate our new reservation size and try again.  If our reservation
      size doesn't change between tries then we know we are actually out of
      space and can error. Flushing that could have been running in parallel
      did not make any space.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ rename to calc_refill_bytes, update comment and changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5df11363
    • Josef Bacik's avatar
      btrfs: don't enospc all tickets on flush failure · f91587e4
      Josef Bacik authored
      With the introduction of the per-inode block_rsv it became possible to
      have really really large reservation requests made because of data
      fragmentation.  Since the ticket stuff assumed that we'd always have
      relatively small reservation requests it just killed all tickets if we
      were unable to satisfy the current request.
      
      However, this is generally not the case anymore.  So fix this logic to
      instead see if we had a ticket that we were able to give some
      reservation to, and if we were continue the flushing loop again.
      
      Likewise we make the tickets use the space_info_add_old_bytes() method
      of returning what reservation they did receive in hopes that it could
      satisfy reservations down the line.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f91587e4
    • Josef Bacik's avatar
      btrfs: don't use global reserve for chunk allocation · 450114fc
      Josef Bacik authored
      We've done this forever because of the voodoo around knowing how much
      space we have.  However, we have better ways of doing this now, and on
      normal file systems we'll easily have a global reserve of 512MiB, and
      since metadata chunks are usually 1GiB that means we'll allocate
      metadata chunks more readily.  Instead use the actual used amount when
      determining if we need to allocate a chunk or not.
      
      This has a side effect for mixed block group fs'es where we are no
      longer allocating enough chunks for the data/metadata requirements.  To
      deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
      machine.  This will only get used if we've already made a full loop
      through the flushing machinery and tried committing the transaction.
      
      If we have then we can try and force a chunk allocation since we likely
      need it to make progress.  This resolves issues I was seeing with
      the mixed bg tests in xfstests without the new flushing state.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      450114fc
    • Josef Bacik's avatar
      btrfs: dump block_rsv details when dumping space info · b78e5616
      Josef Bacik authored
      For enospc_debug having the block rsvs is super helpful to see if we've
      done something wrong.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b78e5616
    • Josef Bacik's avatar
      btrfs: check if there are free block groups for commit · d89dbefb
      Josef Bacik authored
      may_commit_transaction will skip committing the transaction if we don't
      have enough pinned space or if we're trying to find space for a SYSTEM
      chunk.  However, if we have pending free block groups in this transaction
      we still want to commit as we may be able to allocate a chunk to make
      our reservation.  So instead of just returning ENOSPC, check if we have
      free block groups pending, and if so commit the transaction to allow us
      to use that free space.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d89dbefb
    • Dennis Zhou's avatar
      btrfs: add zstd compression level support · 3f93aef5
      Dennis Zhou authored
      Zstd compression requires different amounts of memory for each level of
      compression. The prior patches implemented indirection to allow for each
      compression type to manage their workspaces independently. This patch
      uses this indirection to implement compression level support for zstd.
      
      To manage the additional memory require, each compression level has its
      own queue of workspaces. A global LRU is used to help with reclaim.
      Reclaim is done via a timer which provides a mechanism to decrease
      memory utilization by keeping only workspaces around that are sized
      appropriately. Forward progress is guaranteed by a preallocated max
      workspace hidden from the LRU.
      
      When getting a workspace, it uses a bitmap to identify the levels that
      are populated and scans up. If it finds a workspace that is greater than
      it, it uses it, but does not update the last_used time and the
      corresponding place in the LRU. If we hit memory pressure, we sleep on
      the max level workspace. We continue to rescan in case we can use a
      smaller workspace, but eventually should be able to obtain the max level
      workspace or allocate one again should memory pressure subside.
      
      The memory requirement for decompression is the same as level 1, and
      therefore can use any of available workspace.
      
      The number of workspaces is bound by an upper limit of the workqueue's
      limit which currently is 2 (percpu limit). The reclaim timer is used to
      free inactive/improperly sized workspaces and is set to 307s to avoid
      colliding with transaction commit (every 30s).
      
      Repeating the experiment from v2 [1], the Silesia corpus was copied to a
      btrfs filesystem 10 times and then read back after dropping the caches.
      The btrfs filesystem was on an SSD.
      
      Level   Ratio   Compression (MB/s)  Decompression (MB/s)  Memory (KB)
      1       2.658        438.47                910.51            780
      2       2.744        364.86                886.55           1004
      3       2.801        336.33                828.41           1260
      4       2.858        286.71                886.55           1260
      5       2.916        212.77                556.84           1388
      6       2.363        119.82                990.85           1516
      7       3.000        154.06                849.30           1516
      8       3.011        159.54                875.03           1772
      9       3.025        100.51                940.15           1772
      10      3.033        118.97                616.26           1772
      11      3.036         94.19                802.11           1772
      12      3.037         73.45                931.49           1772
      13      3.041         55.17                835.26           2284
      14      3.087         44.70                716.78           2547
      15      3.126         37.30                878.84           2547
      
      [1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/
      
      Cc: Nick Terrell <terrelln@fb.com>
      Cc: Omar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3f93aef5
    • Dennis Zhou's avatar
      btrfs: make zstd memory requirements monotonic · d3c6ab75
      Dennis Zhou authored
      It is possible based on the level configurations that a higher level
      workspace uses less memory than a lower level workspace. In order to
      reuse workspaces, this must be made a monotonic relationship. This
      precomputes the required memory for each level and enforces the
      monotonicity between level and memory required. This is also done
      in upstream zstd in [1].
      
      [1] https://github.com/facebook/zstd/commit/a68b76afefec6876f8e8a538155109a5aeac0143
      
      Cc: Nick Terrell <terrelln@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3c6ab75
    • Dennis Zhou's avatar
      btrfs: zstd use the passed through level instead of default · e0dc87af
      Dennis Zhou authored
      Zstd currently only supports the default level of compression. This
      patch switches to using the level passed in for btrfs zstd
      configuration.
      
      Zstd workspaces now keep track of the requested level as this can differ
      from the size of the workspace.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0dc87af
    • Dennis Zhou's avatar
      btrfs: change set_level() to bound the level passed in · d0ab62ce
      Dennis Zhou authored
      Currently, the only user of set_level() is zlib which sets an internal
      workspace parameter. As level is now plumbed into get_workspace(), this
      can be handled there rather than separately.
      
      This repurposes set_level() to bound the level passed in so it can be
      used when setting the mounts compression level and as well as verifying
      the level before getting a workspace. The other benefit is this divides
      the meaning of compress(0) and get_workspace(0). The former means we
      want to use the default compression level of the compression type. The
      latter means we can use any workspace available.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0ab62ce
    • Dennis Zhou's avatar
      btrfs: plumb level through the compression interface · 7bf49943
      Dennis Zhou authored
      Zlib compression supports multiple levels, but doesn't require changing
      in how a workspace itself is created and managed. Zstd introduces a
      different memory requirement such that higher levels of compression
      require more memory.
      
      This requires changes in how the alloc()/get() methods work for zstd.
      This pach plumbs compression level through the interface as a parameter
      in preparation for zstd compression levels.  This gives the compression
      types opportunity to create/manage based on the compression level.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7bf49943
    • Dennis Zhou's avatar
      btrfs: move to function pointers for get/put workspaces · 92ee5530
      Dennis Zhou authored
      The previous patch added generic helpers for get_workspace() and
      put_workspace(). Now, we can migrate ownership of the workspace_manager
      to be in the compression type code as the compression code itself
      doesn't care beyond being able to get a workspace. The init/cleanup and
      get/put methods are abstracted so each compression algorithm can decide
      how they want to manage their workspaces.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92ee5530
    • Dennis Zhou's avatar
      btrfs: add compression interface in (get/put)_workspace · 929f4baf
      Dennis Zhou authored
      There are two levels of workspace management. First, alloc()/free()
      which are responsible for actually creating and destroy workspaces.
      Second, at a higher level, get()/put() which is the compression code
      asking for a workspace from a workspace_manager.
      
      The compression code shouldn't really care how it gets a workspace, but
      that it got a workspace. This adds get_workspace() and put_workspace()
      to be the higher level interface which is responsible for indexing into
      the appropriate compression type. It also introduces
      btrfs_put_workspace() and btrfs_get_workspace() to be the generic
      implementations of the higher interface.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      929f4baf
    • Dennis Zhou's avatar
      btrfs: add helper methods for workspace manager init and cleanup · 1666edab
      Dennis Zhou authored
      Workspace manager init and cleanup code is open coded inside a for loop
      over the compression types. This forces each compression type to rely on
      the same workspace manager implementation. This patch creates helper
      methods that will be the generic implementation for btrfs workspace
      management.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1666edab
    • Dennis Zhou's avatar
      btrfs: unify compression ops with workspace_manager · 10b94a51
      Dennis Zhou authored
      Make the workspace_manager own the interface operations rather than
      managing index-paired arrays for the workspace_manager and compression
      operations.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10b94a51
    • Dennis Zhou's avatar
      btrfs: manage heuristic workspace as index 0 · ca4ac360
      Dennis Zhou authored
      While the heuristic workspaces aren't really compression workspaces,
      they use the same interface for managing them. So rather than branching,
      let's just handle them once again as the index 0 compression type.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca4ac360
    • Dennis Zhou's avatar
      btrfs: rename workspaces_list to workspace_manager · acce85de
      Dennis Zhou authored
      This is in preparation for zstd compression levels. As each level will
      require different size of workspace, workspaces_list is no longer a
      really fitting name.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      acce85de
    • Dennis Zhou's avatar
      btrfs: add helpers for compression type and level · 1972708a
      Dennis Zhou authored
      It is very easy to miss places that rely on a certain bitshifting for
      decoding the type_level overloading. Add helpers to do this instead.
      
      Cc: Omar Sandoval <osandov@osandov.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1972708a
    • Anand Jain's avatar
      btrfs: introduce new ioctl to unregister a btrfs device · 228a73ab
      Anand Jain authored
      Support for a new command that can be used eg. as a command
      
        $ btrfs device scan --forget [dev]'
      (the final name may change though)
      
      to undo the effects of 'btrfs device scan [dev]'. For this purpose
      this patch proposes to use ioctl #5 as it was empty and is next to the
      SCAN ioctl.
      
      The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device
      (/dev/btrfs-control) to unregister one or all devices, devices that are
      not mounted.
      
      The argument is struct btrfs_ioctl_vol_args, ::name specifies the device
      path. To unregister all device, the path is an empty string.
      
      Again, the devices are removed only if they aren't part of a mounte
      filesystem.
      
      This new ioctl provides:
      
      - release of unwanted btrfs_fs_devices and btrfs_devices structures
        from memory if the device is not going to be mounted
      
      - ability to mount filesystem in degraded mode, when one devices is
        corrupted like in split brain raid1
      
      - running test cases which would require reloading the kernel module
        but this is not possible eg. due to mounted filesystem or built-in
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      228a73ab
    • Josef Bacik's avatar
      btrfs: replace cleaner_delayed_iput_mutex with a waitqueue · 034f784d
      Josef Bacik authored
      The throttle path doesn't take cleaner_delayed_iput_mutex, which means
      we could think we're done flushing iputs in the data space reservation
      path when we could have a throttler doing an iput.  There's no real
      reason to serialize the delayed iput flushing, so instead of taking the
      cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
      replace it with an atomic counter and a waitqueue.  This removes the
      short (or long depending on how big the inode is) window where we think
      there are no more pending iputs when there really are some.
      
      The waiting is killable as it could be indirectly called from user
      operations like fallocate or zero-range. Such call sites should handle
      the error but otherwise it's not necessary. Eg. flush_space just needs
      to attempt to make space by waiting on iputs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add killable comment and changelog parts ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      034f784d
    • Qu Wenruo's avatar
      btrfs: Output ENOSPC debug info in inc_block_group_ro · 3ece54e5
      Qu Wenruo authored
      Since inc_block_group_ro() would return -ENOSPC, outputting debug info
      for enospc_debug mount option would be helpful to debug some balance
      false ENOSPC report.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ece54e5
    • Qu Wenruo's avatar
      btrfs: qgroup: Remove duplicated trace points for qgroup_rsv_add/release · c8f72b98
      Qu Wenruo authored
      Inside qgroup_rsv_add/release(), we have trace events
      trace_qgroup_update_reserve() to catch reserved space update.
      
      However we still have two manual trace_qgroup_update_reserve() calls
      just outside these functions.  Remove these duplicated calls.
      
      Fixes: 64ee4e75 ("btrfs: qgroup: Update trace events to use new separate rsv types")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8f72b98
    • Anders Roxell's avatar
      btrfs: let the assertion expression compile in all configs · 2eec5f00
      Anders Roxell authored
      A compiler warning (in a patch in development) pointed to a variable
      that was used only inside and ASSERT:
      
        u64 root_objectid = root->root_key.objectid;
        ASSERT(root_objectid == ...);
      
        fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
        fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
          u64 root_objectid = root->root_key.objectid;
      	^~~~~~~~~~~~~
      
      When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.
      
      Rework the assertion helper by adding a runtime check instead of the
      '#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
      condition being passed into an inline function after preprocessing.
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2eec5f00
    • David Sterba's avatar
      btrfs: merge btrfs_set_lock_blocking_rw with it's caller · 766ece54
      David Sterba authored
      The last caller that does not have a fixed value of lock is
      btrfs_set_path_blocking, that actually does the same conditional swtich
      by the lock type so we can merge the branches together and remove the
      helper.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      766ece54
    • David Sterba's avatar
      btrfs: simplify waiting loop in btrfs_tree_lock · 970e74d9
      David Sterba authored
      Currently, the number of readers and writers is checked and in case
      there are any, wait and redo the locks. There's some duplication
      before the branches go back to again label, eg. calling wait_event on
      blocking_readers twice.
      
      The sequence is transformed
      
      loop:
      * wait for readers
      * wait for writers
      * write_lock
      * check readers, unlock and wait for readers, loop
      * check writers, unlock and wait for writers, loop
      
      The new sequence is not exactly the same due to the simplification, for
      readers it's slightly faster. For the writers, original code does
      
      * wait for writers
      * (loop) wait for readers
      *        wait for writers -- again
      
      while the new goes directly to the reader check. This should behave the
      same on a contended lock with multiple writers and readers, but can
      reduce number of times we're waiting on something.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      970e74d9