1. 24 Sep, 2019 2 commits
    • Dennis Zhou's avatar
      btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer · eb5b64f1
      Dennis Zhou authored
      Before, if a eb failed to write out, we would end up triggering a
      BUG_ON(). As of f4340622 ("btrfs: extent_io: Move the BUG_ON() in
      flush_write_bio() one level up"), we no longer BUG_ON(), so we should
      make life consistent and add back the unwritten bytes to
      dirty_metadata_bytes.
      
      Fixes: f4340622 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      CC: stable@vger.kernel.org # 5.2+
      Reviewed-by: default avatarFilipe Manana <fdmanana@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb5b64f1
    • Filipe Manana's avatar
      Btrfs: fix selftests failure due to uninitialized i_mode in test inodes · 9f7fec0b
      Filipe Manana authored
      Some of the self tests create a test inode, setup some extents and then do
      calls to btrfs_get_extent() to test that the corresponding extent maps
      exist and are correct. However btrfs_get_extent(), since the 5.2 merge
      window, now errors out when it finds a regular or prealloc extent for an
      inode that does not correspond to a regular file (its ->i_mode is not
      S_IFREG). This causes the self tests to fail sometimes, specially when
      KASAN, slub_debug and page poisoning are enabled:
      
        $ modprobe btrfs
        modprobe: ERROR: could not insert 'btrfs': Invalid argument
      
        $ dmesg
        [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on
        [ 9414.692655] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
        [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
        [ 9414.692918] BTRFS: selftest: running extent only tests
        [ 9414.693061] BTRFS: selftest: running bitmap only tests
        [ 9414.693366] BTRFS: selftest: running bitmap and extent tests
        [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests
        [ 9414.697131] BTRFS: selftest: running extent buffer operation tests
        [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
        [ 9414.697564] BTRFS: selftest: running extent I/O tests
        [ 9414.697583] BTRFS: selftest: running find delalloc tests
        [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
        [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
        [ 9415.124192] BTRFS: selftest: running inode tests
        [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
        [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
        [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256
        [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0
      
      This happens because the test inodes are created without ever initializing
      the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
      callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
      i_mode is done through the various callbacks used by the VFS to create
      new inodes (regular files, directories, symlinks, tmpfiles, etc), which
      all call btrfs_new_inode() which in turn calls inode_init_owner(), which
      sets the inode's i_mode. Since the tests only uses new_inode() to create
      the test inodes, the i_mode was never initialized.
      
      This always happens on a VM I used with kasan, slub_debug and many other
      debug facilities enabled. It also happened to someone who reported this
      on bugzilla (on a 5.3-rc).
      
      Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().
      
      Fixes: 6bf9e4bd ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f7fec0b
  2. 09 Sep, 2019 38 commits
    • Nikolay Borisov's avatar
      btrfs: Relinquish CPUs in btrfs_compare_trees · 6af112b1
      Nikolay Borisov authored
      When doing any form of incremental send the parent and the child trees
      need to be compared via btrfs_compare_trees. This  can result in long
      loop chains without ever relinquishing the CPU. This causes softlockup
      detector to trigger when comparing trees with a lot of items. Example
      report:
      
      watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
      CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      pstate: 40000005 (nZcv daif -PAN -UAO)
      pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
      lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
      sp : ffff00001273b7e0
      Call trace:
       __ll_sc_arch_atomic_sub_return+0x14/0x20
       release_extent_buffer+0xdc/0x120 [btrfs]
       free_extent_buffer.part.0+0xb0/0x118 [btrfs]
       free_extent_buffer+0x24/0x30 [btrfs]
       btrfs_release_path+0x4c/0xa0 [btrfs]
       btrfs_free_path.part.0+0x20/0x40 [btrfs]
       btrfs_free_path+0x24/0x30 [btrfs]
       get_inode_info+0xa8/0xf8 [btrfs]
       finish_inode_if_needed+0xe0/0x6d8 [btrfs]
       changed_cb+0x9c/0x410 [btrfs]
       btrfs_compare_trees+0x284/0x648 [btrfs]
       send_subvol+0x33c/0x520 [btrfs]
       btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
       btrfs_ioctl+0x199c/0x2288 [btrfs]
       do_vfs_ioctl+0x4b0/0x820
       ksys_ioctl+0x84/0xb8
       __arm64_sys_ioctl+0x28/0x38
       el0_svc_common.constprop.0+0x7c/0x188
       el0_svc_handler+0x34/0x90
       el0_svc+0x8/0xc
      
      Fix this by adding a call to cond_resched at the beginning of the main
      loop in btrfs_compare_trees.
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6af112b1
    • Nikolay Borisov's avatar
      btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic · 65e99c43
      Nikolay Borisov authored
      Those function are simple boolean predicates there is no need to assign
      their return values to interim variables. Use them directly as
      predicates. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65e99c43
    • Johannes Thumshirn's avatar
      btrfs: create structure to encode checksum type and length · af024ed2
      Johannes Thumshirn authored
      Create a structure to encode the type and length for the known on-disk
      checksums.  This makes it easier to add new checksums later.
      
      The structure and helpers are moved from ctree.h so they don't occupy
      space in all headers including ctree.h. This save some space in the
      final object.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af024ed2
    • Johannes Thumshirn's avatar
      btrfs: turn checksum type define into an enum · e35b79a1
      Johannes Thumshirn authored
      Turn the checksum type definition into a enum. This eases later addition
      of new checksums.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e35b79a1
    • Josef Bacik's avatar
      btrfs: add enospc debug messages for ticket failure · 84fe47a4
      Josef Bacik authored
      When debugging weird enospc problems it's handy to be able to dump the
      space info when we wake up all tickets, and see what the ticket values
      are.  This helped me figure out cases where we were enospc'ing when we
      shouldn't have been.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84fe47a4
    • Josef Bacik's avatar
      btrfs: do not account global reserve in can_overcommit · 0096420a
      Josef Bacik authored
      We ran into a problem in production where a box with plenty of space was
      getting wedged doing ENOSPC flushing.  These boxes only had 20% of the
      disk allocated, but their metadata space + global reserve was right at
      the size of their metadata chunk.
      
      In this case can_overcommit should be allowing allocations without
      problem, but there's logic in can_overcommit that doesn't allow us to
      overcommit if there's not enough real space to satisfy the global
      reserve.
      
      This is for historical reasons.  Before there were only certain places
      we could allocate chunks.  We could go to commit the transaction and not
      have enough space for our pending delayed refs and such and be unable to
      allocate a new chunk.  This would result in a abort because of ENOSPC.
      This code was added to solve this problem.
      
      However since then we've gained the ability to always be able to
      allocate a chunk.  So we can easily overcommit in these cases without
      risking a transaction abort because of ENOSPC.
      
      Also prior to now the global reserve really would be used because that's
      the space we relied on for delayed refs.  With delayed refs being
      tracked separately we no longer have to worry about running out of
      delayed refs space while committing.  We are much less likely to
      exhaust our global reserve space during transaction commit.
      
      Fix the can_overcommit code to simply see if our current usage + what we
      want is less than our current free space plus whatever slack space we
      have in the disk is.  This solves the problem we were seeing in
      production and keeps us from flushing as aggressively as we approach our
      actual metadata size usage.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0096420a
    • Josef Bacik's avatar
      btrfs: use btrfs_try_granting_tickets in update_global_rsv · 426551f6
      Josef Bacik authored
      We have some annoying xfstests tests that will create a very small fs,
      fill it up, delete it, and repeat to make sure everything works right.
      This trips btrfs up sometimes because we may commit a transaction to
      free space, but most of the free metadata space was being reserved by
      the global reserve.  So we commit and update the global reserve, but the
      space is simply added to bytes_may_use directly, instead of trying to
      add it to existing tickets.  This results in ENOSPC when we really did
      have space.  Fix this by calling btrfs_try_granting_tickets once we add
      back our excess space to wake any pending tickets.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      426551f6
    • Josef Bacik's avatar
      btrfs: always reserve our entire size for the global reserve · d792b0f1
      Josef Bacik authored
      While messing with the overcommit logic I noticed that sometimes we'd
      ENOSPC out when really we should have run out of space much earlier.  It
      turns out it's because we'll only reserve up to the free amount left in
      the space info for the global reserve, but that doesn't make sense with
      overcommit because we could be well above our actual size.  This results
      in the global reserve not carving out it's entire reservation, and thus
      not putting enough pressure on the rest of the infrastructure to do the
      right thing and ENOSPC out at a convenient time.  Fix this by always
      taking our full reservation amount for the global reserve.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d792b0f1
    • Josef Bacik's avatar
      btrfs: change the minimum global reserve size · 3593ce30
      Josef Bacik authored
      It made sense to have the global reserve set at 16M in the past, but
      since it is used less nowadays set the minimum size to the number of
      items we'll need to update the main trees we update during a transaction
      commit, plus some slop area so we can do unlinks if we need to.
      
      In practice this doesn't affect normal file systems, but for xfstests
      where we do things like fill up a fs and then rm * it can fall over in
      weird ways.  This enables us for more sane behavior at extremely small
      file system sizes.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3593ce30
    • Josef Bacik's avatar
      btrfs: rename btrfs_space_info_add_old_bytes · d05e4649
      Josef Bacik authored
      This name doesn't really fit with how the space reservation stuff works
      now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what
      the function is doing.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d05e4649
    • Josef Bacik's avatar
      btrfs: remove orig_bytes from reserve_ticket · def936e5
      Josef Bacik authored
      Now that we do not do partial filling of tickets simply remove
      orig_bytes, it is no longer needed.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      def936e5
    • Josef Bacik's avatar
      btrfs: fix may_commit_transaction to deal with no partial filling · 00c0135e
      Josef Bacik authored
      Now that we aren't partially filling tickets we may have some slack
      space left in the space_info.  We need to account for this in
      may_commit_transaction, otherwise we may choose to not commit the
      transaction despite it actually having enough space to satisfy our
      ticket.
      
      Calculate the free space we have in the space_info, if any, and subtract
      this from the ticket we have and use that amount to determine if we will
      need to commit to reclaim enough space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00c0135e
    • Josef Bacik's avatar
      btrfs: rework wake_all_tickets · 2341ccd1
      Josef Bacik authored
      Now that we no longer partially fill tickets we need to rework
      wake_all_tickets to call btrfs_try_to_wakeup_tickets() in order to see
      if any subsequent tickets are able to be satisfied.  If our tickets_id
      changes we know something happened and we can keep flushing.
      
      Also if we find a ticket that is smaller than the first ticket in our
      queue then we want to retry the flushing loop again in case
      may_commit_transaction() decides we could satisfy the ticket by
      committing the transaction.
      
      Rename this to maybe_fail_all_tickets() while we're at it, to better
      reflect what the function is actually doing.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2341ccd1
    • Josef Bacik's avatar
      btrfs: refactor the ticket wakeup code · 18fa2284
      Josef Bacik authored
      Now that btrfs_space_info_add_old_bytes simply checks if we can make the
      reservation and updates bytes_may_use, there's no reason to have both
      helpers in place.
      
      Factor out the ticket wakeup logic into it's own helper, make
      btrfs_space_info_add_old_bytes() update bytes_may_use and then call the
      wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes()
      with the wakeup helper.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18fa2284
    • Josef Bacik's avatar
      btrfs: stop partially refilling tickets when releasing space · 91182645
      Josef Bacik authored
      btrfs_space_info_add_old_bytes is used when adding the extra space from
      an existing reservation back into the space_info to be used by any
      waiting tickets.  In order to keep us from overcommitting we check to
      make sure that we can still use this space for our reserve ticket, and
      if we cannot we'll simply subtract it from space_info->bytes_may_use.
      
      However this is problematic, because it assumes that only changes to
      bytes_may_use would affect our ability to make reservations.  Any
      changes to bytes_reserved would be missed.  If we were unable to make a
      reservation prior because of reserved space, but that reserved space was
      free'd due to unlink or truncate and we were allowed to immediately
      reclaim that metadata space we would still ENOSPC.
      
      Consider the example where we create a file with a bunch of extents,
      using up 2MiB of actual space for the new tree blocks.  Then we try to
      make a reservation of 2MiB but we do not have enough space to make this
      reservation.  The iput() occurs in another thread and we remove this
      space, and since we did not write the blocks we simply do
      space_info->bytes_reserved -= 2MiB.  We would never see this because we
      do not check our space info used, we just try to re-use the freed
      reservations.
      
      To fix this problem, and to greatly simplify the wakeup code, do away
      with this partial refilling nonsense.  Use
      btrfs_space_info_add_old_bytes to subtract the reservation from
      space_info->bytes_may_use, and then check the ticket against the total
      used of the space_info the same way we do with the initial reservation
      attempt.
      
      This keeps the reservation logic consistent and solves the problem of
      early ENOSPC in the case that we free up space in places other than
      bytes_may_use and bytes_pinned.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91182645
    • Josef Bacik's avatar
      btrfs: add space reservation tracepoint for reserved bytes · a43c3835
      Josef Bacik authored
      I noticed when folding the trace_btrfs_space_reservation() tracepoint
      into the btrfs_space_info_update_* helpers that we didn't emit a
      tracepoint when doing btrfs_add_reserved_bytes().  I know this is
      because we were swapping bytes_may_use for bytes_reserved, so in my mind
      there was no reason to have the tracepoint there.  But now there is
      because we always emit the unreserve for the bytes_may_use side, and
      this would have broken if compression was on anyway.  Add a tracepoint
      to cover the bytes_reserved counter so the math still comes out right.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a43c3835
    • Josef Bacik's avatar
      btrfs: roll tracepoint into btrfs_space_info_update helper · f3e75e38
      Josef Bacik authored
      We duplicate this tracepoint everywhere we call these helpers, so update
      the helper to have the tracepoint as well.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3e75e38
    • Josef Bacik's avatar
      btrfs: do not allow reservations if we have pending tickets · ef1317a1
      Josef Bacik authored
      If we already have tickets on the list we don't want to steal their
      reservations.  This is a preparation patch for upcoming changes,
      technically this shouldn't happen today because of the way we add bytes
      to tickets before adding them to the space_info in most cases.
      
      This does not change the FIFO nature of reserve tickets, it simply
      allows us to enforce it in a different way.  Previously it was enforced
      because any new space would be added to the first ticket on the list,
      which would result in new reservations getting a reserve ticket.  This
      replaces that mechanism by simply checking to see if we have outstanding
      reserve tickets and skipping straight to adding a ticket for our
      reservation.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef1317a1
    • Omar Sandoval's avatar
      btrfs: stop clearing EXTENT_DIRTY in inode I/O tree · e182163d
      Omar Sandoval authored
      Since commit fee187d9 ("Btrfs: do not set EXTENT_DIRTY along with
      EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
      can simplify and stop trying to clear it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e182163d
    • Omar Sandoval's avatar
      btrfs: treat RWF_{,D}SYNC writes as sync for CRCs · f50cb7af
      Omar Sandoval authored
      The VFS indicates a synchronous write to ->write_iter() via
      iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file
      (see iocb_flags()) or the RWF_* flags passed to a syscall like
      pwritev2() (see kiocb_set_rw_flags()).
      
      However, in btrfs_file_write_iter(), we're checking if a write is
      synchronous based only on the file; we use this to decide when to bump
      the sync_writers counter and thus do CRCs synchronously. Make sure we do
      this for all synchronous writes as determined by the VFS.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add const ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f50cb7af
    • Omar Sandoval's avatar
      btrfs: use correct count in btrfs_file_write_iter() · c09767a8
      Omar Sandoval authored
      generic_write_checks() may modify iov_iter_count(), so we must get the
      count after the call, not before. Using the wrong one has a couple of
      consequences:
      
      1. We check a longer range in check_can_nocow() for nowait than we're
         actually writing.
      2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
         can tell, this is harmless, but I might be missing something.
      
      These issues are pretty minor, but let's fix it before something more
      important trips on it.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c09767a8
    • David Sterba's avatar
      btrfs: tie extent buffer and it's token together · c82f823c
      David Sterba authored
      Further simplifaction of the get/set helpers is possible when the token
      is uniquely tied to an extent buffer. A condition and an assignment can
      be avoided.
      
      The initializations are moved closer to the first use when the extent
      buffer is valid. There's one exception in __push_leaf_left where the
      token is reused.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c82f823c
    • David Sterba's avatar
      btrfs: assume valid token for btrfs_set/get_token helpers · 48bc3950
      David Sterba authored
      Now that we can safely assume that the token is always a valid pointer,
      remove the branches that check that.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48bc3950
    • David Sterba's avatar
      btrfs: define separate btrfs_set/get_XX helpers · cb495113
      David Sterba authored
      There are helpers for all type widths defined via macro and optionally
      can use a token which is a cached pointer to avoid repeated mapping of
      the extent buffer.
      
      The token value is known at compile time, when it's valid it's always
      address of a local variable, otherwise it's NULL passed by the
      token-less helpers.
      
      This can be utilized to remove some branching as the helpers are used
      frequenlty.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb495113
    • Nikolay Borisov's avatar
      btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref · 6ff49c6a
      Nikolay Borisov authored
      btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it
      found a backref for the given name. If it returns true then the actual
      inode_ref struct is returned in one of its parameters. That's pointless,
      instead refactor the function such that it returns either a pointer
      to the btrfs_inode_extref or NULL it it didn't find anything. This
      streamlines the function calling convention.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ff49c6a
    • Nikolay Borisov's avatar
      btrfs: Make btrfs_find_name_in_backref return btrfs_inode_ref struct · 9bb8407f
      Nikolay Borisov authored
      btrfs_find_name_in_backref returns either 0/1 depending on whether it
      found a backref for the given name. If it returns true then the actual
      inode_ref struct is returned in one of its parameters. That's pointless,
      instead refactor the function such that it returns either a pointer
      to the btrfs_inode_ref or NULL it it didn't find anything. This
      streamlines the function calling convention.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9bb8407f
    • David Sterba's avatar
      btrfs: move dev_stats helpers to volumes.c · 1dc990df
      David Sterba authored
      The other dev stats functions are already there and the helpers are not
      used by anything else.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1dc990df
    • David Sterba's avatar
      btrfs: move struct io_ctl to free-space-cache.h · 67b61aef
      David Sterba authored
      The io_ctl structure is used for free space management, and used only by
      the v1 space cache code, but unfortunatlly the full definition is
      required by block-group.h so it can't be moved to free-space-cache.c
      without additional changes.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67b61aef
    • David Sterba's avatar
      btrfs: move functions for tree compare to send.c · 18d0f5c6
      David Sterba authored
      Send is the only user of tree_compare, we can move it there along with
      the other helpers and definitions.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18d0f5c6
    • David Sterba's avatar
      btrfs: rename and export read_node_slot · 4b231ae4
      David Sterba authored
      Preparatory work for code that will be moved out of ctree and uses this
      function.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4b231ae4
    • David Sterba's avatar
    • David Sterba's avatar
      784352fe
    • David Sterba's avatar
      btrfs: move cond_wake_up functions out of ctree · 602cbe91
      David Sterba authored
      The file ctree.h serves as a header for everything and has become quite
      bloated. Split some helpers that are generic and create a new file that
      should be the catch-all for code that's not btrfs-specific.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      602cbe91
    • Anand Jain's avatar
      btrfs: use proper error values on allocation failure in clone_fs_devices · d2979aa2
      Anand Jain authored
      Fix the fake ENOMEM return error code to the actual error in
      clone_fs_devices().
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2979aa2
    • Anand Jain's avatar
      btrfs: proper error handling when invalid device is found in find_next_devid · a06dee4d
      Anand Jain authored
      In a corrupted tree, if search for next devid finds the device with
      devid = -1, then report the error -EUCLEAN back to the parent function
      to fail gracefully.
      
      The tree checker will not catch this in case the devids are created
      using the following script:
      
        umount /btrfs
        dev1=/dev/sdb
        dev2=/dev/sdc
        mkfs.btrfs -fq -dsingle -msingle $dev1
        mount $dev1 /btrfs
      
        _fail()
        {
      	  echo $1
      	  exit 1
        }
      
        while true; do
      	  btrfs dev add -f $dev2 /btrfs || _fail "add failed"
      	  btrfs dev del $dev1 /btrfs || _fail "del failed"
      	  dev_tmp=$dev1
      	  dev1=$dev2
      	  dev2=$dev_tmp
        done
      
      With output:
      
        BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506]
        BTRFS error (device sdb): block=313739198464 write time tree block corruption detected
        BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction)
        BTRFS warning (device sdb): Skipping commit of aborted transaction.
        BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      [ add script and messages ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a06dee4d
    • Christophe Leroy's avatar
      btrfs: fix allocation of free space cache v1 bitmap pages · 3acd4850
      Christophe Leroy authored
      Various notifications of type "BUG kmalloc-4096 () : Redzone
      overwritten" have been observed recently in various parts of the kernel.
      After some time, it has been made a relation with the use of BTRFS
      filesystem and with SLUB_DEBUG turned on.
      
      [   22.809700] BUG kmalloc-4096 (Tainted: G        W        ): Redzone overwritten
      
      [   22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
      [   22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
      [   22.811193] 	__slab_alloc.constprop.26+0x44/0x70
      [   22.811345] 	kmem_cache_alloc_trace+0xf0/0x2ec
      [   22.811588] 	__load_free_space_cache+0x588/0x780 [btrfs]
      [   22.811848] 	load_free_space_cache+0xf4/0x1b0 [btrfs]
      [   22.812090] 	cache_block_group+0x1d0/0x3d0 [btrfs]
      [   22.812321] 	find_free_extent+0x680/0x12a4 [btrfs]
      [   22.812549] 	btrfs_reserve_extent+0xec/0x220 [btrfs]
      [   22.812785] 	btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
      [   22.813032] 	__btrfs_cow_block+0x150/0x5d4 [btrfs]
      [   22.813262] 	btrfs_cow_block+0x194/0x298 [btrfs]
      [   22.813484] 	commit_cowonly_roots+0x44/0x294 [btrfs]
      [   22.813718] 	btrfs_commit_transaction+0x63c/0xc0c [btrfs]
      [   22.813973] 	close_ctree+0xf8/0x2a4 [btrfs]
      [   22.814107] 	generic_shutdown_super+0x80/0x110
      [   22.814250] 	kill_anon_super+0x18/0x30
      [   22.814437] 	btrfs_kill_super+0x18/0x90 [btrfs]
      [   22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
      [   22.814841] 	proc_cgroup_show+0xc0/0x248
      [   22.814967] 	proc_single_show+0x54/0x98
      [   22.815086] 	seq_read+0x278/0x45c
      [   22.815190] 	__vfs_read+0x28/0x17c
      [   22.815289] 	vfs_read+0xa8/0x14c
      [   22.815381] 	ksys_read+0x50/0x94
      [   22.815475] 	ret_from_syscall+0x0/0x38
      
      Commit 69d24804 ("btrfs: use copy_page for copying pages instead of
      memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
      have the size of a page, they were allocated with kzalloc().
      
      Most of the time, kzalloc() allocates aligned blocks of memory, so
      copy_page() can be used. But when some debug options like SLAB_DEBUG are
      activated, kzalloc() may return unaligned pointer.
      
      On powerpc, memcpy(), copy_page() and other copying functions use
      'dcbz' instruction which provides an entire zeroed cacheline to avoid
      memory read when the intention is to overwrite a full line. Functions
      like memcpy() are writen to care about partial cachelines at the start
      and end of the destination, but copy_page() assumes it gets pages. As
      pages are naturally cache aligned, copy_page() doesn't care about
      partial lines. This means that when copy_page() is called with a
      misaligned pointer, a few leading bytes are zeroed.
      
      To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
      The cache pool is created with PAGE_SIZE alignment constraint.
      Reported-by: default avatarErhard F. <erhard_f@mailbox.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
      Fixes: 69d24804 ("btrfs: use copy_page for copying pages instead of memcpy")
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_free_space_bitmap ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3acd4850
    • Qu Wenruo's avatar
      btrfs: Detect unbalanced tree with empty leaf before crashing btree operations · 62fdaa52
      Qu Wenruo authored
      [BUG]
      With crafted image, btrfs will panic at btree operations:
      
        kernel BUG at fs/btrfs/ctree.c:3894!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1138 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:__push_leaf_left+0x6b6/0x6e0
        RSP: 0018:ffffc0bd4128b990 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffffa0a4ab8f0e38 RCX: 0000000000000000
        RDX: ffffa0a280000000 RSI: 0000000000000000 RDI: ffffa0a4b3814000
        RBP: ffffc0bd4128ba38 R08: 0000000000001000 R09: ffffc0bd4128b948
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000240
        R13: ffffa0a4b556fb60 R14: ffffa0a4ab8f0af0 R15: ffffa0a4ab8f0af0
        FS: 0000000000000000(0000) GS:ffffa0a4b7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f2461c80020 CR3: 000000022b32a006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        push_leaf_left+0x179/0x190
        btrfs_del_items+0x316/0x470
        btrfs_del_csums+0x215/0x3a0
        __btrfs_free_extent.isra.72+0x5a7/0xbe0
        __btrfs_run_delayed_refs+0x539/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace c2425e6e89b5558f ]---
      
      [CAUSE]
      The offending csum tree looks like this:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
      	  ...
      	  key (EXTENT_CSUM EXTENT_CSUM 85975040) block 29630464 gen 17
      	  key (EXTENT_CSUM EXTENT_CSUM 89911296) block 29642752 gen 17 <<<
      	  key (EXTENT_CSUM EXTENT_CSUM 92274688) block 29646848 gen 17
      	  ...
      
        leaf 29630464 items 6 free space 1 generation 17 owner CSUM_TREE
      	  item 0 key (EXTENT_CSUM EXTENT_CSUM 85975040) itemoff 3987 itemsize 8
      		  range start 85975040 end 85983232 length 8192
      	  ...
        leaf 29642752 items 0 free space 3995 generation 17 owner 0
      		      ^ empty leaf            invalid owner ^
      
        leaf 29646848 items 1 free space 602 generation 17 owner CSUM_TREE
      	  item 0 key (EXTENT_CSUM EXTENT_CSUM 92274688) itemoff 627 itemsize 3368
      		  range start 92274688 end 95723520 length 3448832
      
      So we have a corrupted csum tree where one tree leaf is completely
      empty, causing unbalanced btree, thus leading to unexpected btree
      balance error.
      
      [FIX]
      For this particular case, we handle it in two directions to catch it:
      - Check if the tree block is empty through btrfs_verify_level_key()
        So that invalid tree blocks won't be read out through
        btrfs_search_slot() and its variants.
      
      - Check 0 tree owner in tree checker
        NO tree is using 0 as its tree owner, detect it and reject at tree
        block read time.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202821Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62fdaa52
    • Nikolay Borisov's avatar
      btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag · ebc87351
      Nikolay Borisov authored
      Support for asynchronous snapshot creation was originally added in
      72fd032e ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for
      ceph's backend needs. However, since Ceph has deprecated support for
      btrfs there is no longer need for that support in btrfs. Additionally,
      this was never supported by btrfs-progs, the official userspace tools.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebc87351