1. 13 Mar, 2019 3 commits
    • Filipe Manana's avatar
      Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes · 609e804d
      Filipe Manana authored
      When we are mixing buffered writes with direct IO writes against the same
      file and snapshotting is happening concurrently, we can end up with a
      corrupt file content in the snapshot. Example:
      
      1) Inode/file is empty.
      
      2) Snapshotting starts.
      
      2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
         inode to 256Kb, disk_i_size remains zero. This happens after the task
         doing the snapshot flushes all existing delalloc.
      
      3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
         completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
         updates the inode item in the fs tree with a size of 1Mb (which is
         the value of disk_i_size).
      
      4) The dealloc for the range [0, 256Kb[ did not start yet.
      
      5) The transaction used in the DIO ordered extent completion, which updated
         the inode item, is committed by the snapshotting task.
      
      6) Snapshot creation completes.
      
      7) Dealloc for the range [0, 256Kb[ is flushed.
      
      After that when reading the file from the snapshot we always get zeroes for
      the range [0, 256Kb[, the file has a size of 1Mb and the data written by
      the direct IO write is found. From an application's point of view this is
      a corruption, since in the source subvolume it could never read a version
      of the file that included the data from the direct IO write without the
      data from the buffered write included as well. In the snapshot's tree,
      file extent items are missing for the range [0, 256Kb[.
      
      The issue, obviously, does not happen when using the -o flushoncommit
      mount option.
      
      Fix this by flushing delalloc for all the roots that are about to be
      snapshotted when committing a transaction. This guarantees total ordering
      when updating the disk_i_size of an inode since the flush for dealloc is
      done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
      is done once no more external writers exist. This is similar to what we
      do when using the flushoncommit mount option, but we do it only if the
      transaction has snapshots to create and only for the roots of the
      subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
      snapshot creation ioctl, so the flush work we do inside the transaction
      is minimized.
      
      This issue, involving buffered and direct IO writes with snapshotting, is
      often triggered by fstest btrfs/078, and got reported by fsck when not
      using the NO_HOLES features, for example:
      
        $ cat results/btrfs/078.full
        (...)
        _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
        *** fsck.btrfs output ***
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 258 inode 264 errors 100, file extent discount
        Found file extent holes:
              start: 524288, len: 65536
        ERROR: errors found in fs roots
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      609e804d
    • Josef Bacik's avatar
      btrfs: remove WARN_ON in log_dir_items · 2cc83342
      Josef Bacik authored
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2cc83342
    • Filipe Manana's avatar
      Btrfs: fix incorrect file size after shrinking truncate and fsync · bf504110
      Filipe Manana authored
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: default avatarSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf504110
  2. 27 Feb, 2019 6 commits
    • Dennis Zhou's avatar
      btrfs: zstd: ensure reclaim timer is properly cleaned up · d3865159
      Dennis Zhou authored
      The timer function, zstd_reclaim_timer_fn(), reschedules itself under
      certain conditions. When cleaning up, take the lock and remove all
      workspaces. This prevents the timer from rearming itself. Lastly, switch
      to del_timer_sync() to ensure that the timer function can't trigger as
      we're unloading.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3865159
    • David Sterba's avatar
      btrfs: move ulist allocation out of transaction in quota enable · 7503b83d
      David Sterba authored
      The allocation happens with GFP_KERNEL after a transaction has been
      started, this can potentially cause deadlock if reclaim tries to get the
      memory by flushing filesystem data.
      
      The fs_info::qgroup_ulist is not used during transaction start when
      quotas are not enabled. The status bit BTRFS_FS_QUOTA_ENABLED is set
      later in btrfs_quota_enable so it's safe to move it before the
      transaction start.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7503b83d
    • Josef Bacik's avatar
      btrfs: save drop_progress if we drop refs at all · aea6f028
      Josef Bacik authored
      Previously we only updated the drop_progress key if we were in the
      DROP_REFERENCE stage of snapshot deletion.  This is because the
      UPDATE_BACKREF stage checks the flags of the blocks it's converting to
      FULL_BACKREF, so if we go over a block we processed before it doesn't
      matter, we just don't do anything.
      
      The problem is in do_walk_down() we will go ahead and drop the roots
      reference to any blocks that we know we won't need to walk into.
      
      Given subvolume A and snapshot B.  The root of B points to all of the
      nodes that belong to A, so all of those nodes have a refcnt > 1.  If B
      did not modify those blocks it'll hit this condition in do_walk_down
      
      if (!wc->update_ref ||
          generation <= root->root_key.offset)
      	goto skip;
      
      and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
      that we point at.
      
      Now assume we modified some data in B, and then took a snapshot of B and
      call it C.  C points to all the nodes in B, making every node the root
      of B points to have a refcnt > 1.  This assumes the root level is 2 or
      higher.
      
      We delete snapshot B, which does the above work in do_walk_down,
      free'ing our ref for nodes we share with A that we didn't modify.  Now
      we hit a node we _did_ modify, thus we own.  We need to walk down into
      this node and we set wc->stage == UPDATE_BACKREF.  We walk down to level
      0 which we also own because we modified data.  We can't walk any further
      down and thus now need to walk up and start the next part of the
      deletion.  Now walk_up_proc is supposed to put us back into
      DROP_REFERENCE, but there's an exception to this
      
      if (level < wc->shared_level)
      	goto out;
      
      we are at level == 0, and our shared_level == 1.  We skip out of this
      one and go up to level 1.  Since path->slots[1] < nritems we
      path->slots[1]++ and break out of walk_up_tree to stop our transaction
      and loop back around.  Now in btrfs_drop_snapshot we have this snippet
      
      if (wc->stage == DROP_REFERENCE) {
      	level = wc->level;
      	btrfs_node_key(path->nodes[level],
      		       &root_item->drop_progress,
      		       path->slots[level]);
      	root_item->drop_level = level;
      }
      
      our stage == UPDATE_BACKREF still, so we don't update the drop_progress
      key.  This is a problem because we would have done btrfs_free_extent()
      for the nodes leading up to our current position.  If we crash or
      unmount here and go to remount we'll start over where we were before and
      try to free our ref for blocks we've already freed, and thus abort()
      out.
      
      Fix this by keeping track of the last place we dropped a reference for
      our block in do_walk_down.  Then if wc->stage == UPDATE_BACKREF we know
      we'll start over from a place we meant to, and otherwise things continue
      to work as they did before.
      
      I have a complicated reproducer for this problem, without this patch
      we'll fail to fsck the fs when replaying the log writes log.  With this
      patch we can replay the whole log without any fsck or mount failures.
      
      The steps to reproduce this easily are sort of tricky, I had to add a
      couple of debug patches to the kernel in order to make it easy,
      basically I just needed to make sure we did actually commit the
      transaction every time we finished a walk_down_tree/walk_up_tree combo.
      
      The reproducer:
      
      1) Creates a base subvolume.
      2) Creates 100k files in the subvolume.
      3) Snapshots the base subvolume (snap1).
      4) Touches files 5000-6000 in snap1.
      5) Snapshots snap1 (snap2).
      6) Deletes snap1.
      
      I do this with dm-log-writes, and then replay to every FUA in the log
      and fsck the fs.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ copy reproducer steps ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aea6f028
    • Josef Bacik's avatar
      btrfs: check for refs on snapshot delete resume · 78c52d9e
      Josef Bacik authored
      There's a bug in snapshot deletion where we won't update the
      drop_progress key if we're in the UPDATE_BACKREF stage.  This is a
      problem because we could drop refs for blocks we know don't belong to
      ours.  If we crash or umount at the right time we could experience
      messages such as the following when snapshot deletion resumes
      
       BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258  owner 1 offset 0
       ------------[ cut here ]------------
       WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       CPU: 3 PID: 16052 Comm: umount Tainted: G        W  OE     5.0.0-rc4+ #147
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
       RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
       RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18
       RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000
       R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0
       FS:  00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0
       Call Trace:
       ? _raw_spin_unlock+0x27/0x40
       ? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs]
       __btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs]
       ? join_transaction+0x2b/0x460 [btrfs]
       btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs]
       btrfs_commit_transaction+0x52/0xa50 [btrfs]
       ? start_transaction+0xa6/0x510 [btrfs]
       btrfs_sync_fs+0x79/0x1c0 [btrfs]
       sync_filesystem+0x70/0x90
       generic_shutdown_super+0x27/0x120
       kill_anon_super+0x12/0x30
       btrfs_kill_super+0x16/0xa0 [btrfs]
       deactivate_locked_super+0x43/0x70
       deactivate_super+0x40/0x60
       cleanup_mnt+0x3f/0x80
       __cleanup_mnt+0x12/0x20
       task_work_run+0x8b/0xc0
       exit_to_usermode_loop+0xce/0xd0
       do_syscall_64+0x20b/0x210
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To fix this simply mark dead roots we read from disk as DEAD and then
      set the walk_control->restarted flag so we know we have a restarted
      deletion.  From here whenever we try to drop refs for blocks we check to
      verify our ref is set on them, and if it is not we skip it.  Once we
      find a ref that is set we unset walk_control->restarted since the tree
      should be in a normal state from then on, and any problems we run into
      from there are different issues.  I tested this with an existing broken
      fs and my reproducer that creates a broken fs and it fixed both file
      systems.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      78c52d9e
    • Filipe Manana's avatar
      Btrfs: fix deadlock between clone/dedupe and rename · 4ea748e1
      Filipe Manana authored
      Reflinking (clone/dedupe) and rename are operations that operate on two
      inodes and therefore need to lock them in the same order to avoid ABBA
      deadlocks. It happens that Btrfs' reflink implementation always locked
      them in a different order from VFS's lock_two_nondirectories() helper,
      which is used by the rename code in VFS, resulting in ABBA type deadlocks.
      
      Btrfs' locking order:
      
        static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2)
        {
               if (inode1 < inode2)
                      swap(inode1, inode2);
      
               inode_lock_nested(inode1, I_MUTEX_PARENT);
               inode_lock_nested(inode2, I_MUTEX_CHILD);
        }
      
      VFS's locking order:
      
        void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
        {
              if (inode1 > inode2)
                      swap(inode1, inode2);
      
              if (inode1 && !S_ISDIR(inode1->i_mode))
                      inode_lock(inode1);
              if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1)
                      inode_lock_nested(inode2, I_MUTEX_NONDIR2);
      }
      
      Fix this by killing the btrfs helper function that does the double inode
      locking and replace it with VFS's helper lock_two_nondirectories().
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 416161db ("btrfs: offline dedupe")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ea748e1
    • Filipe Manana's avatar
      Btrfs: fix corruption reading shared and compressed extents after hole punching · 8e928218
      Filipe Manana authored
      In the past we had data corruption when reading compressed extents that
      are shared within the same file and they are consecutive, this got fixed
      by commit 005efedf ("Btrfs: fix read corruption of compressed and
      shared extents") and by commit 808f80b4 ("Btrfs: update fix for read
      corruption of compressed and shared extents"). However there was a case
      that was missing in those fixes, which is when the shared and compressed
      extents are referenced with a non-zero offset. The following shell script
      creates a reproducer for this issue:
      
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdc &> /dev/null
        mount -o compress /dev/sdc /mnt/sdc
      
        # Create a file with 3 consecutive compressed extents, each has an
        # uncompressed size of 128Kb and a compressed size of 4Kb.
        for ((i = 1; i <= 3; i++)); do
            head -c 4096 /dev/zero
            for ((j = 1; j <= 31; j++)); do
                head -c 4096 /dev/zero | tr '\0' "\377"
            done
        done > /mnt/sdc/foobar
        sync
      
        echo "Digest after file creation:   $(md5sum /mnt/sdc/foobar)"
      
        # Clone the first extent into offsets 128K and 256K.
        xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
        xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
        sync
      
        echo "Digest after cloning:         $(md5sum /mnt/sdc/foobar)"
      
        # Punch holes into the regions that are already full of zeroes.
        xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
        sync
      
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        echo "Dropping page cache..."
        sysctl -q vm.drop_caches=1
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        umount /dev/sdc
      
      When running the script we get the following output:
      
        Digest after file creation:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        linked 131072/131072 bytes at offset 131072
        128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
        linked 131072/131072 bytes at offset 262144
        128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
        Digest after cloning:         5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Digest after hole punching:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Dropping page cache...
        Digest after hole punching:   fba694ae8664ed0c2e9ff8937e7f1484  /mnt/sdc/foobar
      
      This happens because after reading all the pages of the extent in the
      range from 128K to 256K for example, we read the hole at offset 256K
      and then when reading the page at offset 260K we don't submit the
      existing bio, which is responsible for filling all the page in the
      range 128K to 256K only, therefore adding the pages from range 260K
      to 384K to the existing bio and submitting it after iterating over the
      entire range. Once the bio completes, the uncompressed data fills only
      the pages in the range 128K to 256K because there's no more data read
      from disk, leaving the pages in the range 260K to 384K unfilled. It is
      just a slightly different variant of what was solved by commit
      005efedf ("Btrfs: fix read corruption of compressed and shared
      extents").
      
      Fix this by forcing a bio submit, during readpages(), whenever we find a
      compressed extent map for a page that is different from the extent map
      for the previous page or has a different starting offset (in case it's
      the same compressed extent), instead of the extent map's original start
      offset.
      
      A test case for fstests follows soon.
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 808f80b4 ("Btrfs: update fix for read corruption of compressed and shared extents")
      Fixes: 005efedf ("Btrfs: fix read corruption of compressed and shared extents")
      Cc: stable@vger.kernel.org # 4.3+
      Tested-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e928218
  3. 25 Feb, 2019 31 commits
    • YueHaibing's avatar
      btrfs: Remove unnecessary casts in btrfs_read_root_item · f65e25e3
      YueHaibing authored
      There is a messy cast here:
      	min_t(int, len, (int)sizeof(*item)));
      
      min_t() should normally cast to unsigned.  It's not possible for "len"
      to be negative, but if it were then we definitely wouldn't want to pass
      negatives to read_extent_buffer().  Also there is an extra cast.
      
      This patch shouldn't affect runtime, it's just a clean up.
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f65e25e3
    • Filipe Manana's avatar
      Btrfs: remove assertion when searching for a key in a node/leaf · 253002f2
      Filipe Manana authored
      At ctree.c:key_search(), the assertion that verifies the first key on a
      child extent buffer corresponds to the key at a specific slot in the
      parent has a disadvantage: we effectively hit a BUG_ON() which requires
      rebooting the machine later. It also does not tell any information about
      which extent buffer is affected, from which root, the expected and found
      keys, etc.
      
      However as of commit 581c1760 ("btrfs: Validate child tree block's
      level and first key"), that assertion is not needed since at the time we
      read an extent buffer from disk we validate that its first key matches the
      key, at the respective slot, in the parent extent buffer. Therefore just
      remove the assertion at key_search().
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      253002f2
    • Filipe Manana's avatar
      Btrfs: add missing error handling after doing leaf/node binary search · cbca7d59
      Filipe Manana authored
      The function map_private_extent_buffer() can return an -EINVAL error, and
      it is called by generic_bin_search() which will return back the error. The
      btrfs_bin_search() function in turn calls generic_bin_search() and the
      key_search() function calls btrfs_bin_search(), so both can return the
      -EINVAL error coming from the map_private_extent_buffer() function. Some
      callers of these functions were ignoring that these functions can return
      an error, so fix them to deal with error return values.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbca7d59
    • Dan Carpenter's avatar
      btrfs: drop the lock on error in btrfs_dev_replace_cancel · 669e859b
      Dan Carpenter authored
      We should drop the lock on this error path.  This has been found by a
      static tool.
      
      The lock needs to be released, it's there to protect access to the
      dev_replace members and is not supposed to be left locked. The value of
      state that's being switched would need to be artifically changed to an
      invalid value so the default: branch is taken.
      
      Fixes: d189dd70 ("btrfs: fix use-after-free due to race between replace start and cancel")
      CC: stable@vger.kernel.org # 5.0+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      669e859b
    • Johannes Thumshirn's avatar
      btrfs: ensure that a DUP or RAID1 block group has exactly two stripes · 349ae63f
      Johannes Thumshirn authored
      We recently had a customer issue with a corrupted filesystem. When
      trying to mount this image btrfs panicked with a division by zero in
      calc_stripe_length().
      
      The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
      takes this value and divides it by the number of copies the RAID profile
      is expected to have to calculate the amount of data stripes. As a DUP
      profile is expected to have 2 copies this division resulted in 1/2 = 0.
      Later then the 'data_stripes' variable is used as a divisor in the
      stripe length calculation which results in a division by 0 and thus a
      kernel panic.
      
      When encountering a filesystem with a DUP block group and a
      'num_stripes' value unequal to 2, refuse mounting as the image is
      corrupted and will lead to unexpected behaviour.
      
      Code inspection showed a RAID1 block group has the same issues.
      
      Fixes: e06cd3dd ("Btrfs: add validadtion checks for chunk loading")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      349ae63f
    • Dan Robertson's avatar
      btrfs: init csum_list before possible free · e49be14b
      Dan Robertson authored
      The scrub_ctx csum_list member must be initialized before scrub_free_ctx
      is called. If the csum_list is not initialized beforehand, the
      list_empty call in scrub_free_csums will result in a null deref if the
      allocation fails in the for loop.
      
      Fixes: a2de733c ("btrfs: scrub")
      CC: stable@vger.kernel.org # 3.0+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDan Robertson <dan@dlrobertson.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e49be14b
    • Filipe Manana's avatar
      Btrfs: remove no longer needed range length checks for deduplication · 57a50e25
      Filipe Manana authored
      Comparing the content of the pages in the range to deduplicate is now
      done in generic_remap_checks called by the generic helper
      generic_remap_file_range_prep(), which takes care of ensuring we do not
      compare/deduplicate undefined data beyond a file's EOF (range from EOF
      to the next block boundary). So remove these checks which are now
      redundant.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57a50e25
    • Filipe Manana's avatar
      Btrfs: fix fsync after succession of renames and unlink/rmdir · a3baaf0d
      Filipe Manana authored
      After a succession of renames operations of different files and unlinking
      one of them, if we fsync one of the renamed files we can end up with a
      log that will either fail to replay at mount time or result in a filesystem
      that is in an inconsistent state. One example scenario:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/fname1
        $ touch /mnt/testdir/fname2
      
        $ sync
      
        $ mv /mnt/testdir/fname1 /mnt/testdir/fname3
        $ rm -f /mnt/testdir/fname2
        $ ln /mnt/testdir/fname3 /mnt/testdir/fname2
      
        $ touch /mnt/testdir/fname1
        $ xfs_io -c "fsync" /mnt/testdir/fname1
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ umount /mnt
        $ btrfs check /dev/sdb
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 5 inode 259 errors 2, no orphan item
        ERROR: errors found in fs roots
        Opening filesystem to check...
        Checking filesystem on /dev/sdc
        UUID: 20e4abb8-5a19-4492-8bb4-6084125c2d0d
        found 393216 bytes used, error(s) found
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 122986
        file data blocks allocated: 262144
         referenced 262144
      
      On a kernel without the first patch in this series, titled
      "[PATCH] Btrfs: fix fsync after succession of renames of different files",
      we get instead an error when mounting the filesystem due to failure of
      replaying the log:
      
        $ mount /dev/sdb /mnt
        mount: mount /dev/sdb on /mnt failed: File exists
      
      Fix this by logging the parent directory of an inode whenever we find an
      inode that no longer exists (was unlinked in the current transaction),
      during the procedure which finds inodes that have old names that collide
      with new names of other inodes.
      
      A test case for fstests follows soon.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3baaf0d
    • Filipe Manana's avatar
      Btrfs: fix fsync after succession of renames of different files · 6b5fc433
      Filipe Manana authored
      After a succession of rename operations of different files and fsyncing
      one of them, such that each file gets a new name that corresponds to an
      old name of another file, we can end up with a log that will cause a
      failure when attempted to replay at mount time (an EEXIST error).
      We currently have correct behaviour when such succession of renames
      involves only two files, but if there are more files involved, we end up
      not logging all the inodes that are needed, therefore resulting in a
      failure when attempting to replay the log.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/fname1
        $ touch /mnt/testdir/fname2
      
        $ sync
      
        $ mv /mnt/testdir/fname1 /mnt/testdir/fname3
        $ mv /mnt/testdir/fname2 /mnt/testdir/fname4
        $ ln /mnt/testdir/fname3 /mnt/testdir/fname2
      
        $ touch /mnt/testdir/fname1
        $ xfs_io -c "fsync" /mnt/testdir/fname1
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        mount: mount /dev/sdb on /mnt failed: File exists
      
      So fix this by checking all inode dependencies when logging an inode. That
      is, if one logged inode A has a new name that matches the old name of some
      other inode B, check if inode B has a new name that matches the old name
      of some other inode C, and so on. This fix is implemented not by doing any
      recursive function calls but by using an iterative method using a linked
      list that is used in a first-in-first-out fashion.
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b5fc433
    • Josef Bacik's avatar
      btrfs: honor path->skip_locking in backref code · 38e3eebf
      Josef Bacik authored
      Qgroups will do the old roots lookup at delayed ref time, which could be
      while walking down the extent root while running a delayed ref.  This
      should be fine, except we specifically lock eb's in the backref walking
      code irrespective of path->skip_locking, which deadlocks the system.
      Fix up the backref code to honor path->skip_locking, nobody will be
      modifying the commit_root when we're searching so it's completely safe
      to do.
      
      This happens since fb235dc0 ("btrfs: qgroup: Move half of the qgroup
      accounting time out of commit trans"), kernel may lockup with quota
      enabled.
      
      There is one backref trace triggered by snapshot dropping along with
      write operation in the source subvolume.  The example can be reliably
      reproduced:
      
        btrfs-cleaner   D    0  4062      2 0x80000000
        Call Trace:
         schedule+0x32/0x90
         btrfs_tree_read_lock+0x93/0x130 [btrfs]
         find_parent_nodes+0x29b/0x1170 [btrfs]
         btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
         btrfs_find_all_roots+0x57/0x70 [btrfs]
         btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
         btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
         btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
         do_walk_down+0x541/0x5e3 [btrfs]
         walk_down_tree+0xab/0xe7 [btrfs]
         btrfs_drop_snapshot+0x356/0x71a [btrfs]
         btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
         cleaner_kthread+0x12b/0x160 [btrfs]
         kthread+0x112/0x130
         ret_from_fork+0x27/0x50
      
      When dropping snapshots with qgroup enabled, we will trigger backref
      walk.
      
      However such backref walk at that timing is pretty dangerous, as if one
      of the parent nodes get WRITE locked by other thread, we could cause a
      dead lock.
      
      For example:
      
                 FS 260     FS 261 (Dropped)
                  node A        node B
                 /      \      /      \
             node C      node D      node E
            /   \         /  \        /     \
        leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
      
      The lock sequence would be:
      
            Thread A (cleaner)             |       Thread B (other writer)
      -----------------------------------------------------------------------
      write_lock(B)                        |
      write_lock(D)                        |
      ^^^ called by walk_down_tree()       |
                                           |       write_lock(A)
                                           |       write_lock(D) << Stall
      read_lock(H) << for backref walk     |
      read_lock(D) << lock owner is        |
                      the same thread A    |
                      so read lock is OK   |
      read_lock(A) << Stall                |
      
      So thread A hold write lock D, and needs read lock A to unlock.
      While thread B holds write lock A, while needs lock D to unlock.
      
      This will cause a deadlock.
      
      This is not only limited to snapshot dropping case.  As the backref
      walk, even only happens on commit trees, is breaking the normal top-down
      locking order, makes it deadlock prone.
      
      Fixes: fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: default avatarDavid Sterba <dsterba@suse.com>
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [ rebase to latest branch and fix lock assert bug in btrfs/007 ]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ copy logs and deadlock analysis from Qu's patch ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38e3eebf
    • Qu Wenruo's avatar
      btrfs: qgroup: Make qgroup async transaction commit more aggressive · f5fef459
      Qu Wenruo authored
      [BUG]
      Btrfs qgroup will still hit EDQUOT under the following case:
      
        $ dev=/dev/test/test
        $ mnt=/mnt/btrfs
        $ umount $mnt &> /dev/null
        $ umount $dev &> /dev/null
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt -o nospace_cache
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ fallocate -l 900M $mnt/subv/padding
        $ sync
      
        $ rm $mnt/subv/padding
      
        # Hit EDQUOT
        $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file
      
      [CAUSE]
      Since commit a514d638 ("btrfs: qgroup: Commit transaction in advance
      to reduce early EDQUOT"), btrfs is not forced to commit transaction to
      reclaim more quota space.
      
      Instead, we just check pertrans metadata reservation against some
      threshold and try to do asynchronously transaction commit.
      
      However in above case, the pertrans metadata reservation is pretty small
      thus it will never trigger asynchronous transaction commit.
      
      [FIX]
      Instead of only accounting pertrans metadata reservation, we calculate
      how much free space we have, and if there isn't much free space left,
      commit transaction asynchronously to try to free some space.
      
      This may slow down the fs when we have less than 32M free qgroup space,
      but should reduce a lot of false EDQUOT, so the cost should be
      acceptable.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5fef459
    • Qu Wenruo's avatar
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
      Qu Wenruo authored
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
      
      [BUG]
      Btrfs/139 will fail with a high probability if the testing machine (VM)
      has only 2G RAM.
      
      Resulting the final write success while it should fail due to EDQUOT,
      and the fs will have quota exceeding the limit by 16K.
      
      The simplified reproducer will be: (needs a 2G ram VM)
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ for i in $(seq -w  1 8); do
        	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
        	echo "file $i written" > /dev/kmsg
          done
        $ sync
        $ btrfs qgroup show -pcre --raw $mnt
      
      The last pwrite will not trigger EDQUOT and final 'qgroup show' will
      show something like:
      
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5             16384        16384         none         none ---     ---
        0/256      1073758208   1073758208         none   1073741824 ---     ---
      
      And 1073758208 is larger than
        > 1073741824.
      
      [CAUSE]
      It's a bug in btrfs qgroup data reserved space management.
      
      For quota limit, we must ensure that:
        reserved (data + metadata) + rfer/excl <= limit
      
      Since rfer/excl is only updated at transaction commmit time, reserved
      space needs to be taken special care.
      
      One important part of reserved space is data, and for a new data extent
      written to disk, we still need to take the reserved space until
      rfer/excl numbers get updated.
      
      Originally when an ordered extent finishes, we migrate the reserved
      qgroup data space from extent_io tree to delayed ref head of the data
      extent, expecting delayed ref will only be cleaned up at commit
      transaction time.
      
      However for small RAM machine, due to memory pressure dirty pages can be
      flushed back to disk without committing a transaction.
      
      The related events will be something like:
      
        file 1 written
        btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
        btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
        btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
        btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
        btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
        cleanup_ref_head: num_bytes=54947840
        cleanup_ref_head: num_bytes=5636096
        cleanup_ref_head: num_bytes=569344
        cleanup_ref_head: num_bytes=57344
        cleanup_ref_head: num_bytes=8192
        ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
        file 2 written
        ...
        file 8 written
        cleanup_ref_head: num_bytes=8192
        ...
        btrfs_commit_transaction  <<< the only transaction committed during
      				the test
      
      When file 2 is written, we have already freed 128M reserved qgroup data
      space for ino 258. Thus later write won't trigger EDQUOT.
      
      This allows us to write more data beyond qgroup limit.
      
      In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
      
      [FIX]
      By moving reserved qgroup data space from btrfs_delayed_ref_head to
      btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
      space won't be freed half way before commit transaction, thus fix the
      problem.
      
      Fixes: f64d5ca8 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1418bae1
    • David Sterba's avatar
      btrfs: scrub: remove unused nocow worker pointer · 0ea82076
      David Sterba authored
      The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
      optimization was removed from scrub in 9bebe665 ("btrfs: scrub:
      Remove unused copy_nocow_pages and its callchain").
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ea82076
    • David Sterba's avatar
      btrfs: scrub: add assertions for worker pointers · c8352942
      David Sterba authored
      The scrub worker pointers are not NULL iff the scrub is running, so
      reset them back once the last reference is dropped. Add assertions to
      the initial phase of scrub to verify that.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8352942
    • Anand Jain's avatar
      btrfs: scrub: convert scrub_workers_refcnt to refcount_t · ff09c4ca
      Anand Jain authored
      Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
      we get the extra checks. All reference changes are still done under
      scrub_lock.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff09c4ca
    • Anand Jain's avatar
      btrfs: scrub: add scrub_lock lockdep check in scrub_workers_get · eb4318e5
      Anand Jain authored
      scrub_workers_refcnt is protected by scrub_lock, add lockdep_assert_held()
      in scrub_workers_get().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb4318e5
    • Anand Jain's avatar
      btrfs: scrub: fix circular locking dependency warning · 1cec3f27
      Anand Jain authored
      This fixes a longstanding lockdep warning triggered by
      fstests/btrfs/011.
      
      Circular locking dependency check reports warning[1], that's because the
      btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock
      held. The test case leading to this warning:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /btrfs
        $ btrfs scrub start -B /btrfs
      
      In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy
      of the scrub workers are needed. So once we have incremented and decremented
      the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the
      scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this
      patch drops the scrub_lock before calling btrfs_destroy_workqueue().
      
        [359.258534] ======================================================
        [359.260305] WARNING: possible circular locking dependency detected
        [359.261938] 5.0.0-rc6-default #461 Not tainted
        [359.263135] ------------------------------------------------------
        [359.264672] btrfs/20975 is trying to acquire lock:
        [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540
        [359.268416]
        [359.268416] but task is already holding lock:
        [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.272418]
        [359.272418] which lock already depends on the new lock.
        [359.272418]
        [359.274692]
        [359.274692] the existing dependency chain (in reverse order) is:
        [359.276671]
        [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}:
        [359.278187]        __mutex_lock+0x86/0x9c0
        [359.279086]        btrfs_scrub_pause+0x31/0x100 [btrfs]
        [359.280421]        btrfs_commit_transaction+0x1e4/0x9e0 [btrfs]
        [359.281931]        close_ctree+0x30b/0x350 [btrfs]
        [359.283208]        generic_shutdown_super+0x64/0x100
        [359.284516]        kill_anon_super+0x14/0x30
        [359.285658]        btrfs_kill_super+0x12/0xa0 [btrfs]
        [359.286964]        deactivate_locked_super+0x29/0x60
        [359.288242]        cleanup_mnt+0x3b/0x70
        [359.289310]        task_work_run+0x98/0xc0
        [359.290428]        exit_to_usermode_loop+0x83/0x90
        [359.291445]        do_syscall_64+0x15b/0x180
        [359.292598]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.294011]
        [359.294011] -> #2 (sb_internal#2){.+.+}:
        [359.295432]        __sb_start_write+0x113/0x1d0
        [359.296394]        start_transaction+0x369/0x500 [btrfs]
        [359.297471]        btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs]
        [359.298629]        normal_work_helper+0xcd/0x530 [btrfs]
        [359.299698]        process_one_work+0x246/0x610
        [359.300898]        worker_thread+0x3c/0x390
        [359.302020]        kthread+0x116/0x130
        [359.303053]        ret_from_fork+0x24/0x30
        [359.304152]
        [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}:
        [359.306100]        process_one_work+0x21f/0x610
        [359.307302]        worker_thread+0x3c/0x390
        [359.308465]        kthread+0x116/0x130
        [359.309357]        ret_from_fork+0x24/0x30
        [359.310229]
        [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
        [359.311812]        lock_acquire+0x90/0x180
        [359.312929]        flush_workqueue+0xaa/0x540
        [359.313845]        drain_workqueue+0xa1/0x180
        [359.314761]        destroy_workqueue+0x17/0x240
        [359.315754]        btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.317245]        scrub_workers_put+0x2c/0x60 [btrfs]
        [359.318585]        btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.319944]        btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.321622]        btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.322908]        do_vfs_ioctl+0xa2/0x6d0
        [359.324021]        ksys_ioctl+0x3a/0x70
        [359.325066]        __x64_sys_ioctl+0x16/0x20
        [359.326236]        do_syscall_64+0x54/0x180
        [359.327379]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.328772]
        [359.328772] other info that might help us debug this:
        [359.328772]
        [359.330990] Chain exists of:
        [359.330990]   (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock
        [359.330990]
        [359.334376]  Possible unsafe locking scenario:
        [359.334376]
        [359.336020]        CPU0                    CPU1
        [359.337070]        ----                    ----
        [359.337821]   lock(&fs_info->scrub_lock);
        [359.338506]                                lock(sb_internal#2);
        [359.339506]                                lock(&fs_info->scrub_lock);
        [359.341461]   lock((wq_completion)"%s-%s""btrfs", name);
        [359.342437]
        [359.342437]  *** DEADLOCK ***
        [359.342437]
        [359.343745] 1 lock held by btrfs/20975:
        [359.344788]  #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.346778]
        [359.346778] stack backtrace:
        [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461
        [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [359.350501] Call Trace:
        [359.350931]  dump_stack+0x67/0x90
        [359.351676]  print_circular_bug.isra.37.cold.56+0x15c/0x195
        [359.353569]  check_prev_add.constprop.44+0x4f9/0x750
        [359.354849]  ? check_prev_add.constprop.44+0x286/0x750
        [359.356505]  __lock_acquire+0xb84/0xf10
        [359.357505]  lock_acquire+0x90/0x180
        [359.358271]  ? flush_workqueue+0x87/0x540
        [359.359098]  flush_workqueue+0xaa/0x540
        [359.359912]  ? flush_workqueue+0x87/0x540
        [359.360740]  ? drain_workqueue+0x1e/0x180
        [359.361565]  ? drain_workqueue+0xa1/0x180
        [359.362391]  drain_workqueue+0xa1/0x180
        [359.363193]  destroy_workqueue+0x17/0x240
        [359.364539]  btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.365673]  scrub_workers_put+0x2c/0x60 [btrfs]
        [359.366618]  btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.367594]  ? start_transaction+0xa1/0x500 [btrfs]
        [359.368679]  btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.369545]  btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.370186]  ? __lock_acquire+0x263/0xf10
        [359.370777]  ? kvm_clock_read+0x14/0x30
        [359.371392]  ? kvm_sched_clock_read+0x5/0x10
        [359.372248]  ? sched_clock+0x5/0x10
        [359.372786]  ? sched_clock_cpu+0xc/0xc0
        [359.373662]  ? do_vfs_ioctl+0xa2/0x6d0
        [359.374552]  do_vfs_ioctl+0xa2/0x6d0
        [359.375378]  ? do_sigaction+0xff/0x250
        [359.376233]  ksys_ioctl+0x3a/0x70
        [359.376954]  __x64_sys_ioctl+0x16/0x20
        [359.377772]  do_syscall_64+0x54/0x180
        [359.378841]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.380422] RIP: 0033:0x7f5429296a97
      
      Backporting to older kernels: scrub_nocow_workers must be freed the same
      way as the others.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cec3f27
    • Anand Jain's avatar
      btrfs: fix comment its device list mutex not volume lock · 7faad6e2
      Anand Jain authored
      We have killed volume mutex (commit: dccdb07b
      btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have
      escaped.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7faad6e2
    • Qu Wenruo's avatar
      btrfs: extent_io: Kill the forward declaration of flush_write_bio · bb58eb9e
      Qu Wenruo authored
      There is no need to forward declare flush_write_bio(), as it only
      depends on submit_one_bio().  Both of them are pretty small, just move
      them to kill the forward declaration.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb58eb9e
    • Nikolay Borisov's avatar
      btrfs: Fix grossly misleading argument names in extent io search · 352646c7
      Nikolay Borisov authored
      The variables and function parameters of __etree_search which pertain to
      prev/next are grossly misnamed. Namely, prev_ret holds the next state
      and not the previous. Similarly, next_ret actually holds the previous
      extent state relating to the offset we are interested in. Fix this by
      renaming the variables as well as switching the arguments order. No
      functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      352646c7
    • Nikolay Borisov's avatar
      btrfs: Remove EXTENT_FIRST_DELALLOC bit · ba8f5206
      Nikolay Borisov authored
      With the refactoring introduced in 8b62f87b ("Btrfs: reworki
      outstanding_extents") this flag became unused. Remove it and renumber
      the following flags accordingly. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba8f5206
    • Nikolay Borisov's avatar
      btrfs: use WARN_ON in a canonical form btrfs_remove_block_group · 9a0ec83d
      Nikolay Borisov authored
      There is no point in using a construct like 'if (!condition)
      WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a0ec83d
    • Josef Bacik's avatar
      btrfs: reserve extra space during evict · 260e7702
      Josef Bacik authored
      We could generate a lot of delayed refs in evict but never have any left
      over space from our block rsv to make up for that fact.  So reserve some
      extra space and give it to the transaction so it can be used to refill
      the delayed refs rsv every loop through the truncate path.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      260e7702
    • Josef Bacik's avatar
      btrfs: be more explicit about allowed flush states · 8a1bbe1d
      Josef Bacik authored
      For FLUSH_LIMIT flushers we really can only allocate chunks and flush
      delayed inode items, everything else is problematic.  I added a bunch of
      new states and it lead to weirdness in the FLUSH_LIMIT case because I
      forgot about how it worked.  So instead explicitly declare the states
      that are ok for flushing with FLUSH_LIMIT and use that for our state
      machine.  Then as we add new things that are safe we can just add them
      to this list.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a1bbe1d
    • Josef Bacik's avatar
      btrfs: loop in inode_rsv_refill · 5df11363
      Josef Bacik authored
      With severe fragmentation we can end up with our inode rsv size being
      huge during writeout, which would cause us to need to make very large
      metadata reservations.
      
      However we may not actually need that much once writeout is complete,
      because of the over-reservation for the worst case.
      
      So instead try to make our reservation, and if we couldn't make it
      re-calculate our new reservation size and try again.  If our reservation
      size doesn't change between tries then we know we are actually out of
      space and can error. Flushing that could have been running in parallel
      did not make any space.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ rename to calc_refill_bytes, update comment and changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5df11363
    • Josef Bacik's avatar
      btrfs: don't enospc all tickets on flush failure · f91587e4
      Josef Bacik authored
      With the introduction of the per-inode block_rsv it became possible to
      have really really large reservation requests made because of data
      fragmentation.  Since the ticket stuff assumed that we'd always have
      relatively small reservation requests it just killed all tickets if we
      were unable to satisfy the current request.
      
      However, this is generally not the case anymore.  So fix this logic to
      instead see if we had a ticket that we were able to give some
      reservation to, and if we were continue the flushing loop again.
      
      Likewise we make the tickets use the space_info_add_old_bytes() method
      of returning what reservation they did receive in hopes that it could
      satisfy reservations down the line.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f91587e4
    • Josef Bacik's avatar
      btrfs: don't use global reserve for chunk allocation · 450114fc
      Josef Bacik authored
      We've done this forever because of the voodoo around knowing how much
      space we have.  However, we have better ways of doing this now, and on
      normal file systems we'll easily have a global reserve of 512MiB, and
      since metadata chunks are usually 1GiB that means we'll allocate
      metadata chunks more readily.  Instead use the actual used amount when
      determining if we need to allocate a chunk or not.
      
      This has a side effect for mixed block group fs'es where we are no
      longer allocating enough chunks for the data/metadata requirements.  To
      deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
      machine.  This will only get used if we've already made a full loop
      through the flushing machinery and tried committing the transaction.
      
      If we have then we can try and force a chunk allocation since we likely
      need it to make progress.  This resolves issues I was seeing with
      the mixed bg tests in xfstests without the new flushing state.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      450114fc
    • Josef Bacik's avatar
      btrfs: dump block_rsv details when dumping space info · b78e5616
      Josef Bacik authored
      For enospc_debug having the block rsvs is super helpful to see if we've
      done something wrong.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b78e5616
    • Josef Bacik's avatar
      btrfs: check if there are free block groups for commit · d89dbefb
      Josef Bacik authored
      may_commit_transaction will skip committing the transaction if we don't
      have enough pinned space or if we're trying to find space for a SYSTEM
      chunk.  However, if we have pending free block groups in this transaction
      we still want to commit as we may be able to allocate a chunk to make
      our reservation.  So instead of just returning ENOSPC, check if we have
      free block groups pending, and if so commit the transaction to allow us
      to use that free space.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d89dbefb
    • Dennis Zhou's avatar
      btrfs: add zstd compression level support · 3f93aef5
      Dennis Zhou authored
      Zstd compression requires different amounts of memory for each level of
      compression. The prior patches implemented indirection to allow for each
      compression type to manage their workspaces independently. This patch
      uses this indirection to implement compression level support for zstd.
      
      To manage the additional memory require, each compression level has its
      own queue of workspaces. A global LRU is used to help with reclaim.
      Reclaim is done via a timer which provides a mechanism to decrease
      memory utilization by keeping only workspaces around that are sized
      appropriately. Forward progress is guaranteed by a preallocated max
      workspace hidden from the LRU.
      
      When getting a workspace, it uses a bitmap to identify the levels that
      are populated and scans up. If it finds a workspace that is greater than
      it, it uses it, but does not update the last_used time and the
      corresponding place in the LRU. If we hit memory pressure, we sleep on
      the max level workspace. We continue to rescan in case we can use a
      smaller workspace, but eventually should be able to obtain the max level
      workspace or allocate one again should memory pressure subside.
      
      The memory requirement for decompression is the same as level 1, and
      therefore can use any of available workspace.
      
      The number of workspaces is bound by an upper limit of the workqueue's
      limit which currently is 2 (percpu limit). The reclaim timer is used to
      free inactive/improperly sized workspaces and is set to 307s to avoid
      colliding with transaction commit (every 30s).
      
      Repeating the experiment from v2 [1], the Silesia corpus was copied to a
      btrfs filesystem 10 times and then read back after dropping the caches.
      The btrfs filesystem was on an SSD.
      
      Level   Ratio   Compression (MB/s)  Decompression (MB/s)  Memory (KB)
      1       2.658        438.47                910.51            780
      2       2.744        364.86                886.55           1004
      3       2.801        336.33                828.41           1260
      4       2.858        286.71                886.55           1260
      5       2.916        212.77                556.84           1388
      6       2.363        119.82                990.85           1516
      7       3.000        154.06                849.30           1516
      8       3.011        159.54                875.03           1772
      9       3.025        100.51                940.15           1772
      10      3.033        118.97                616.26           1772
      11      3.036         94.19                802.11           1772
      12      3.037         73.45                931.49           1772
      13      3.041         55.17                835.26           2284
      14      3.087         44.70                716.78           2547
      15      3.126         37.30                878.84           2547
      
      [1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/
      
      Cc: Nick Terrell <terrelln@fb.com>
      Cc: Omar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3f93aef5
    • Dennis Zhou's avatar
      btrfs: make zstd memory requirements monotonic · d3c6ab75
      Dennis Zhou authored
      It is possible based on the level configurations that a higher level
      workspace uses less memory than a lower level workspace. In order to
      reuse workspaces, this must be made a monotonic relationship. This
      precomputes the required memory for each level and enforces the
      monotonicity between level and memory required. This is also done
      in upstream zstd in [1].
      
      [1] https://github.com/facebook/zstd/commit/a68b76afefec6876f8e8a538155109a5aeac0143
      
      Cc: Nick Terrell <terrelln@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3c6ab75