1. 02 Jul, 2019 36 commits
    • Goldwyn Rodrigues's avatar
      btrfs: Simplify update of space_info in __reserve_metadata_bytes() · 9b4851bc
      Goldwyn Rodrigues authored
      We don't need an if-else-if chain where we can use a simple OR since
      both conditions are performing the same action. The short-circuit for OR
      will ensure that if the first condition is true, can_overcommit() is not
      called.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b4851bc
    • Josef Bacik's avatar
      btrfs: unexport can_overcommit · 83d731a5
      Josef Bacik authored
      Now that we've moved all of the users to space-info.c, unexport it and
      name it back to can_overcommit.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83d731a5
    • Josef Bacik's avatar
      btrfs: move reserve_metadata_bytes and supporting code to space-info.c · 0d9764f6
      Josef Bacik authored
      This moves all of the metadata reservation code into space-info.c.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d9764f6
    • Josef Bacik's avatar
      btrfs: move dump_space_info to space-info.c · 5da6afeb
      Josef Bacik authored
      We'll need this exported so we can use it in all the various was we need
      to use it.  This is prep work to move reserve_metadata_bytes.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5da6afeb
    • Josef Bacik's avatar
      btrfs: export block_rsv_use_bytes · c2a67a76
      Josef Bacik authored
      We are going to need this to move the metadata reservation stuff to
      space_info.c.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2a67a76
    • Josef Bacik's avatar
      btrfs: move btrfs_space_info_add_*_bytes to space-info.c · b338b013
      Josef Bacik authored
      Now that we've moved all the pre-requisite stuff, move these two
      functions.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b338b013
    • Josef Bacik's avatar
      btrfs: move the space info update macro to space-info.h · bb96c4e5
      Josef Bacik authored
      Also rename it to btrfs_space_info_update_* so it's clear what we're
      updating.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb96c4e5
    • Josef Bacik's avatar
      btrfs: move and export can_overcommit · 41783ef2
      Josef Bacik authored
      This is the first piece of moving the space reservation code to
      space-info.c
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      41783ef2
    • Josef Bacik's avatar
      btrfs: move the space_info handling code to space-info.c · 280c2908
      Josef Bacik authored
      These are the basic init and lookup functions and some helper functions,
      fairly straightforward before the bad stuff starts.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      280c2908
    • Josef Bacik's avatar
      btrfs: export space_info_add_*_bytes · d44b72aa
      Josef Bacik authored
      Prep work for consolidating all of the space_info code into one file.
      We need to export these so multiple files can use them.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d44b72aa
    • Josef Bacik's avatar
      btrfs: rename do_chunk_alloc to btrfs_chunk_alloc · fc471cb0
      Josef Bacik authored
      Really we just need the enum, but as we break more things up it'll help
      to have this external to extent-tree.c.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc471cb0
    • Josef Bacik's avatar
      btrfs: move space_info to space-info.h · 8719aaae
      Josef Bacik authored
      Migrate the struct definition and the one helper that's in ctree.h into
      space-info.h
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8719aaae
    • David Sterba's avatar
      btrfs: lift bio_set_dev from bio allocation helpers · e749af44
      David Sterba authored
      The block device is passed around for the only purpose to set it in new
      bios. Move the assignment one level up. This is a preparatory patch for
      further bdev cleanups.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e749af44
    • David Sterba's avatar
      btrfs: use raid_attr for minimum stripe count in btrfs_calc_avail_data_space · e1ea2bee
      David Sterba authored
      Minimum stripe count matches the minimum devices required for a given
      profile. The open coded assignments match the raid_attr table.
      
      What's changed here is the meaning for RAID5/6. Previously their
      min_stripes would be 1, while newly it's devs_min. This however shold be
      the same as before because it's not possible to create filesystem on
      fewer devices than the raid_attr table allows.
      
      There's no adjustment regarding the parity stripes (like
      calc_data_stripes does), because we're interested in overall space that
      would fit on the devices.
      
      Missing devices make no difference for the whole calculation, we have
      the size stored in the structures.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1ea2bee
    • David Sterba's avatar
      btrfs: use raid_attr to adjust minimal stripe size in btrfs_calc_avail_data_space · 4f080f57
      David Sterba authored
      Special case for DUP can be replaced by lookup to the attribute table,
      where the dev_stripes is the right coefficient.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f080f57
    • David Sterba's avatar
      btrfs: drop default value assignments in enums · f262fa8d
      David Sterba authored
      A few more instances whre we don't need to specify the values as long as
      they are the same that enum assigns automatically. All of the enums are
      in-memory only and nothing relies on the exact values.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f262fa8d
    • David Sterba's avatar
      btrfs: use common helpers for extent IO state insertion messages · 2792237d
      David Sterba authored
      Print the error messages using the helpers that also print the
      filesystem identification.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2792237d
    • Josef Bacik's avatar
      btrfs: run delayed iput at unlink time · 63611e73
      Josef Bacik authored
      We have been seeing issues in production where a cleaner script will end
      up unlinking a bunch of files that have pending iputs.  This means they
      will get their final iput's run at btrfs-cleaner time and thus are not
      throttled, which impacts the workload.
      
      Since we are unlinking these files we can just drop the delayed iput at
      unlink time.  We are already holding a reference to the inode so this
      will not be the final iput and thus is completely safe to do at this
      point.  Doing this means we are more likely to be doing the final iput
      at unlink time, and thus will get the IO charged to the caller and get
      throttled appropriately without affecting the main workload.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63611e73
    • Filipe Manana's avatar
      Btrfs: add missing inode version, ctime and mtime updates when punching hole · 17900668
      Filipe Manana authored
      If the range for which we are punching a hole covers only part of a page,
      we end up updating the inode item but we skip the update of the inode's
      iversion, mtime and ctime. Fix that by ensuring we update those properties
      of the inode.
      
      A patch for fstests test case generic/059 that tests this as been sent
      along with this fix.
      
      Fixes: 2aaa6655 ("Btrfs: add hole punching")
      Fixes: e8c1c76e ("Btrfs: add missing inode update when punching hole")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17900668
    • Filipe Manana's avatar
      Btrfs: fix fsync not persisting dentry deletions due to inode evictions · 803f0f64
      Filipe Manana authored
      In order to avoid searches on a log tree when unlinking an inode, we check
      if the inode being unlinked was logged in the current transaction, as well
      as the inode of its parent directory. When any of the inodes are logged,
      we proceed to delete directory items and inode reference items from the
      log, to ensure that if a subsequent fsync of only the inode being unlinked
      or only of the parent directory when the other is not fsync'ed as well,
      does not result in the entry still existing after a power failure.
      
      That check however is not reliable when one of the inodes involved (the
      one being unlinked or its parent directory's inode) is evicted, since the
      logged_trans field is transient, that is, it is not stored on disk, so it
      is lost when the inode is evicted and loaded into memory again (which is
      set to zero on load). As a consequence the checks currently being done by
      btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
      return true if the inode was evicted before, regardless of the inode
      having been logged or not before (and in the current transaction), this
      results in the dentry being unlinked still existing after a log replay
      if after the unlink operation only one of the inodes involved is fsync'ed.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Keep an open file descriptor on our directory while we evict inodes.
        # We just want to evict the file's inode, the directory's inode must not
        # be evicted.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time to background process to chdir to our test
        # directory.
        $ sleep 0.5
      
        # Trigger eviction of the file's inode.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Unlink our file and fsync the parent directory. After a power failure
        # we don't expect to see the file anymore, since we fsync'ed the parent
        # directory.
        $ rm -f $SCRATCH_MNT/dir/foo
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ ls /mnt/dir
        foo
        $
         --> file still there, unlink not persisted despite explicit fsync on dir
      
      Fix this by checking if the inode has the full_sync bit set in its runtime
      flags as well, since that bit is set everytime an inode is loaded from
      disk, or for other less common cases such as after a shrinking truncate
      or failure to allocate extent maps for holes, and gets cleared after the
      first fsync. Also consider the inode as possibly logged only if it was
      last modified in the current transaction (besides having the full_fsync
      flag set).
      
      Fixes: 3a5f1d45 ("Btrfs: Optimize btree walking while logging inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      803f0f64
    • Nikolay Borisov's avatar
      btrfs: Use btrfs_get_io_geometry appropriately · 89b798ad
      Nikolay Borisov authored
      Presently btrfs_map_block is used not only to do everything necessary to
      map a bio to the underlying allocation profile but it's also used to
      identify how much data could be written based on btrfs' stripe logic
      without actually submitting anything. This is achieved by passing NULL
      for 'bbio_ret' parameter.
      
      This patch refactors all callers that require just the mapping length
      by switching them to using btrfs_io_geometry instead of calling
      btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
      change.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      89b798ad
    • Nikolay Borisov's avatar
      btrfs: Introduce btrfs_io_geometry infrastructure · 5f141126
      Nikolay Borisov authored
      Add a structure that holds various parameters for IO calculations and a
      helper that fills the values. This will help further refactoring and
      reduction of functions that in some way open-coded the calculations.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f141126
    • David Sterba's avatar
      btrfs: improve messages when updating feature flags · c9d713d5
      David Sterba authored
      Currently the messages printed after setting an incompat feature are
      cryptis, we can easily make it better as the textual description is
      passed to the helpers. Old:
      
        setting 128 feature flag
      
      updated:
      
        setting incompat feature flag for RAID56 (0x80)
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9d713d5
    • Arnd Bergmann's avatar
      btrfs: shut up bogus -Wmaybe-uninitialized warning · 6c64460c
      Arnd Bergmann authored
      gcc sometimes can't determine whether a variable has been initialized
      when both the initialization and the use are conditional:
      
      fs/btrfs/props.c: In function 'inherit_props':
      fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized]
          btrfs_block_rsv_release(fs_info, trans->block_rsv,
      
      This code is fine. Unfortunately, I cannot think of a good way to
      rephrase it in a way that makes gcc understand this, so I add a bogus
      initialization the way one should not.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ gcc 8 and 9 don't emit the warning ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c64460c
    • Filipe Manana's avatar
      Btrfs: prevent send failures and crashes due to concurrent relocation · 9e967495
      Filipe Manana authored
      Send always operates on read-only trees and always expected that while it
      is in progress, nothing changes in those trees. Due to that expectation
      and the fact that send is a read-only operation, it operates on commit
      roots and does not hold transaction handles. However relocation can COW
      nodes and leafs from read-only trees, which can cause unexpected failures
      and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
      COWed, the transaction used to COW it is committed, a new transaction
      starts, the extent previously used for that node/leaf gets allocated,
      possibly for another tree, and the respective extent buffer' content
      changes while send is still using it. When this happens send normally
      fails with EIO being returned to user space and messages like the
      following are found in dmesg/syslog:
      
        [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
        [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
      
      Other times, less often, we hit a BUG_ON() because an extent buffer that
      send is using used to be a node, and while send is still using it, it
      got COWed and got reused as a leaf while send is still using, producing
      the following trace:
      
       [ 3478.466280] ------------[ cut here ]------------
       [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
       [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
       [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
       [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
       [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
       [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
       [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
       [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       [ 3478.475840] FS:  00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
       [ 3478.476489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
       [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [ 3478.479003] Call Trace:
       [ 3478.479600]  ? do_raw_spin_unlock+0x49/0xc0
       [ 3478.480202]  tree_advance+0x173/0x1d0 [btrfs]
       [ 3478.480810]  btrfs_compare_trees+0x30c/0x690 [btrfs]
       [ 3478.481388]  ? process_extent+0x1280/0x1280 [btrfs]
       [ 3478.481954]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [ 3478.482510]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [ 3478.483062]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [ 3478.483581]  ? rq_clock_task+0x2e/0x60
       [ 3478.484086]  ? wake_up_new_task+0x1f3/0x370
       [ 3478.484582]  ? do_vfs_ioctl+0xa2/0x6f0
       [ 3478.485075]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [ 3478.485552]  do_vfs_ioctl+0xa2/0x6f0
       [ 3478.486016]  ? __fget+0x113/0x200
       [ 3478.486467]  ksys_ioctl+0x70/0x80
       [ 3478.486911]  __x64_sys_ioctl+0x16/0x20
       [ 3478.487337]  do_syscall_64+0x60/0x1b0
       [ 3478.487751]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
       (...)
       [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
       [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
       [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
       [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
       [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
      
      Another possibility, much less likely to happen, is that send will not
      fail but the contents of the stream it produces may not be correct.
      
      To avoid this, do not allow send and relocation (balance) to run in
      parallel. In the long term the goal is to allow for both to be able to
      run concurrently without any problems, but that will take a significant
      effort in development and testing.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e967495
    • David Sterba's avatar
      btrfs: document BTRFS_MAX_MIRRORS · 71a9c488
      David Sterba authored
      The real meaning of that constant is not clear from the context due to
      the target device inclusion.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71a9c488
    • David Sterba's avatar
      btrfs: use mask for RAID56 profiles · a07e8a46
      David Sterba authored
      We don't need to enumerate the profiles, use the mask for consistency.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a07e8a46
    • David Sterba's avatar
      btrfs: add mask for all RAID1 types · c7369b3f
      David Sterba authored
      Preparatory patch for additional RAID1 profiles with more copies. The
      mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
      work the same for the other profiles.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7369b3f
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() · e88439de
      Qu Wenruo authored
      [BUG]
      Lockdep will report the following circular locking dependency:
      
        WARNING: possible circular locking dependency detected
        5.2.0-rc2-custom #24 Tainted: G           O
        ------------------------------------------------------
        btrfs/8631 is trying to acquire lock:
        000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]
      
        but task is already holding lock:
        000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->tree_log_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x475/0xa00 [btrfs]
               btrfs_commit_super+0x71/0x80 [btrfs]
               close_ctree+0x2bd/0x320 [btrfs]
               btrfs_put_super+0x15/0x20 [btrfs]
               generic_shutdown_super+0x72/0x110
               kill_anon_super+0x18/0x30
               btrfs_kill_super+0x16/0xa0 [btrfs]
               deactivate_locked_super+0x3a/0x80
               deactivate_super+0x51/0x60
               cleanup_mnt+0x3f/0x80
               __cleanup_mnt+0x12/0x20
               task_work_run+0x94/0xb0
               exit_to_usermode_loop+0xd8/0xe0
               do_syscall_64+0x210/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&fs_info->reloc_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x40d/0xa00 [btrfs]
               btrfs_quota_enable+0x2da/0x730 [btrfs]
               btrfs_ioctl+0x2691/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
               lock_acquire+0xa7/0x190
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_qgroup_inherit+0x40/0x620 [btrfs]
               create_pending_snapshot+0x9d7/0xe60 [btrfs]
               create_pending_snapshots+0x94/0xb0 [btrfs]
               btrfs_commit_transaction+0x415/0xa00 [btrfs]
               btrfs_mksubvol+0x496/0x4e0 [btrfs]
               btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
               btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
               btrfs_ioctl+0xa90/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        other info that might help us debug this:
      
        Chain exists of:
          &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&fs_info->tree_log_mutex);
                                       lock(&fs_info->reloc_mutex);
                                       lock(&fs_info->tree_log_mutex);
          lock(&fs_info->qgroup_ioctl_lock#2);
      
         *** DEADLOCK ***
      
        6 locks held by btrfs/8631:
         #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
         #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
         #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
         #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
         #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
         #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
      [CAUSE]
      Due to the delayed subvolume creation, we need to call
      btrfs_qgroup_inherit() inside commit transaction code, with a lot of
      other mutex hold.
      This hell of lock chain can lead to above problem.
      
      [FIX]
      On the other hand, we don't really need to hold qgroup_ioctl_lock if
      we're in the context of create_pending_snapshot().
      As in that context, we're the only one being able to modify qgroup.
      
      All other qgroup functions which needs qgroup_ioctl_lock are either
      holding a transaction handle, or will start a new transaction:
        Functions will start a new transaction():
        * btrfs_quota_enable()
        * btrfs_quota_disable()
        Functions hold a transaction handler:
        * btrfs_add_qgroup_relation()
        * btrfs_del_qgroup_relation()
        * btrfs_create_qgroup()
        * btrfs_remove_qgroup()
        * btrfs_limit_qgroup()
        * btrfs_qgroup_inherit() call inside create_subvol()
      
      So we have a higher level protection provided by transaction, thus we
      don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().
      
      Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
      qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
      create_pending_snapshot() is already protected by transaction.
      
      So the fix is to detect the context by checking
      trans->transaction->state.
      If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
      context and no need to get the mutex.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e88439de
    • Johannes Thumshirn's avatar
      btrfs: correctly validate compression type · aa53e3bf
      Johannes Thumshirn authored
      Nikolay reported the following KASAN splat when running btrfs/048:
      
      [ 1843.470920] ==================================================================
      [ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0
      [ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979
      
      [ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536
      [ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [ 1843.476322] Call Trace:
      [ 1843.476674]  dump_stack+0x7c/0xbb
      [ 1843.477132]  ? strncmp+0x66/0xb0
      [ 1843.477587]  print_address_description+0x114/0x320
      [ 1843.478256]  ? strncmp+0x66/0xb0
      [ 1843.478740]  ? strncmp+0x66/0xb0
      [ 1843.479185]  __kasan_report+0x14e/0x192
      [ 1843.479759]  ? strncmp+0x66/0xb0
      [ 1843.480209]  kasan_report+0xe/0x20
      [ 1843.480679]  strncmp+0x66/0xb0
      [ 1843.481105]  prop_compression_validate+0x24/0x70
      [ 1843.481798]  btrfs_xattr_handler_set_prop+0x65/0x160
      [ 1843.482509]  __vfs_setxattr+0x71/0x90
      [ 1843.483012]  __vfs_setxattr_noperm+0x84/0x130
      [ 1843.483606]  vfs_setxattr+0xac/0xb0
      [ 1843.484085]  setxattr+0x18c/0x230
      [ 1843.484546]  ? vfs_setxattr+0xb0/0xb0
      [ 1843.485048]  ? __mod_node_page_state+0x1f/0xa0
      [ 1843.485672]  ? _raw_spin_unlock+0x24/0x40
      [ 1843.486233]  ? __handle_mm_fault+0x988/0x1290
      [ 1843.486823]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487330]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487842]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.488442]  ? debug_lockdep_rcu_enabled+0x22/0x40
      [ 1843.489089]  ? rcu_sync_lockdep_assert+0xe/0x70
      [ 1843.489707]  ? __sb_start_write+0x158/0x200
      [ 1843.490278]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.490855]  ? __mnt_want_write+0x98/0xe0
      [ 1843.491397]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.492201]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 1843.493201]  do_syscall_64+0x6c/0x230
      [ 1843.493988]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1843.495041] RIP: 0033:0x7fa7a8a7707a
      [ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48
      [ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be
      [ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a
      [ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003
      [ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000
      [ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91
      [ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff
      
      [ 1843.505268] Allocated by task 3979:
      [ 1843.505771]  save_stack+0x19/0x80
      [ 1843.506211]  __kasan_kmalloc.constprop.5+0xa0/0xd0
      [ 1843.506836]  setxattr+0xeb/0x230
      [ 1843.507264]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.507886]  do_syscall_64+0x6c/0x230
      [ 1843.508429]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [ 1843.509558] Freed by task 0:
      [ 1843.510188] (stack is not available)
      
      [ 1843.511309] The buggy address belongs to the object at ffff888111e369e0
                      which belongs to the cache kmalloc-8 of size 8
      [ 1843.514095] The buggy address is located 2 bytes inside of
                      8-byte region [ffff888111e369e0, ffff888111e369e8)
      [ 1843.516524] The buggy address belongs to the page:
      [ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0
      [ 1843.519993] flags: 0x4404000010200(slab|head)
      [ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300
      [ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000
      [ 1843.524281] page dumped because: kasan: bad access detected
      
      [ 1843.525936] Memory state around the buggy address:
      [ 1843.526975]  ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.528479]  ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc
      [ 1843.531877]                                                        ^
      [ 1843.533287]  ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.534874]  ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.536468] ==================================================================
      
      This is caused by supplying a too short compression value ('lz') in the
      test-case and comparing it to 'lzo' with strncmp() and a length of 3.
      strncmp() read past the 'lz' when looking for the 'o' and thus caused an
      out-of-bounds read.
      
      Introduce a new check 'btrfs_compress_is_valid_type()' which not only
      checks the user-supplied value against known compression types, but also
      employs checks for too short values.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Fixes: 272e5326 ("btrfs: prop: fix vanished compression property after failed set")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa53e3bf
    • Filipe Manana's avatar
      Btrfs: fix data loss after inode eviction, renaming it, and fsync it · d1d832a0
      Filipe Manana authored
      When we log an inode, regardless of logging it completely or only that it
      exists, we always update it as logged (logged_trans and last_log_commit
      fields of the inode are updated). This is generally fine and avoids future
      attempts to log it from having to do repeated work that brings no value.
      
      However, if we write data to a file, then evict its inode after all the
      dealloc was flushed (and ordered extents completed), rename the file and
      fsync it, we end up not logging the new extents, since the rename may
      result in logging that the inode exists in case the parent directory was
      logged before. The following reproducer shows and explains how this can
      happen:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ touch /mnt/dir/bar
      
        # Do a direct IO write instead of a buffered write because with a
        # buffered write we would need to make sure dealloc gets flushed and
        # complete before we do the inode eviction later, and we can not do that
        # from user space with call to things such as sync(2) since that results
        # in a transaction commit as well.
        $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
      
        # Keep the directory dir in use while we evict inodes. We want our file
        # bar's inode to be evicted but we don't want our directory's inode to
        # be evicted (if it were evicted too, we would not be able to reproduce
        # the issue since the first fsync below, of file foo, would result in a
        # transaction commit.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time for the background process to chdir.
        $ sleep 0.1
      
        # Evict all inodes, except the inode for the directory dir because it is
        # currently in use by our background process.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # fsync file foo, which ends up persisting information about the parent
        # directory because it is a new inode.
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Rename bar, this results in logging that this inode exists (inode item,
        # names, xattrs) because the parent directory is in the log.
        $ mv /mnt/dir/bar /mnt/dir/baz
      
        # Now fsync baz, which ends up doing absolutely nothing because of the
        # rename operation which logged that the inode exists only.
        $ xfs_io -c fsync /mnt/dir/baz
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/dir/baz
        0000000
      
          --> Empty file, data we wrote is missing.
      
      Fix this by not updating last_sub_trans of an inode when we are logging
      only that it exists and the inode was not yet logged since it was loaded
      from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
      return false for this scenario and make us log the inode. The logged_trans
      of the inode is still always setsince that alone is used to track if names
      need to be deleted as part of unlink operations.
      
      Fixes: 257c62e1 ("Btrfs: avoid tree log commit when there are no changes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1d832a0
    • David Sterba's avatar
      btrfs: raid56: clear incompat block group flags after removing the last one · 6d58a55a
      David Sterba authored
      The incompat bit for RAID56 is set either at mount time or automatically
      when the profile is used by balance. The part where the bit is removed
      is missing and can be unexpected or undesired when an older kernel is
      needed.
      
      This patch will drop the incompat bit after this command, assuming
      that RAID5 profile is not used by system or metadata:
      
       $ btrfs balance start -dconvert=raid5 /mnt
       $ btrfs balance start -dconvert=raid1 /mnt
      
      This will print "clearing 128 feature flag" to the system log.
      
      The patch is safe for backporting to older kernels.
      Reported-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d58a55a
    • David Sterba's avatar
      btrfs: switch extent_buffer write_locks from atomic to int · 00801ae4
      David Sterba authored
      The write_locks is either 0 or 1 and always updated under the lock,
      so we don't need the atomic_t semantics.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00801ae4
    • David Sterba's avatar
      btrfs: switch extent_buffer spinning_writers from atomic to int · f3dc24c5
      David Sterba authored
      The spinning_writers is either 0 or 1 and always updated under the lock,
      so we don't need the atomic_t semantics.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3dc24c5
    • David Sterba's avatar
      btrfs: switch extent_buffer blocking_writers from atomic to int · 06297d8c
      David Sterba authored
      The blocking_writers is either 0 or 1 and always updated under the lock,
      so we don't need the atomic_t semantics.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06297d8c
    • David Sterba's avatar
      btrfs: assert delayed ref lock in btrfs_find_delayed_ref_head · 38e9372e
      David Sterba authored
      Turn the comment about required lock into an assertion.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38e9372e
  2. 01 Jul, 2019 4 commits