1. 17 Dec, 2018 40 commits
    • Filipe Manana's avatar
      Btrfs: use generic_remap_file_range_prep() for cloning and deduplication · 34a28e3d
      Filipe Manana authored
      Since cloning and deduplication are no longer Btrfs specific operations, we
      now have generic code to handle parameter validation, compare file ranges
      used for deduplication, clear capabilities when cloning, etc. This change
      makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
      few bugs, such as:
      
      1) When cloning, the destination file's capabilities were not dropped
         (the fstest generic/513 tests this);
      
      2) We were not checking if the destination file is immutable;
      
      3) Not checking if either the source or destination files are swap
         files (swap file support is coming soon for Btrfs);
      
      4) System limits were not checked (resource limits and O_LARGEFILE).
      
      Note that the generic helper generic_remap_file_range_prep() does start
      and waits for writeback by calling filemap_write_and_wait_range(), however
      that is not enough for Btrfs for two reasons:
      
      1) With compression, we need to start writeback twice in order to get the
         pages marked for writeback and ordered extents created;
      
      2) filemap_write_and_wait_range() (and all its other variants) only waits
         for the IO to complete, but we need to wait for the ordered extents to
         finish, so that when we do the actual reflinking operations the file
         extent items are in the fs tree. This is also important due to the fact
         that the generic helper, for the deduplication case, compares the
         contents of the pages in the requested range, which might require
         reading extents from disk in the very unlikely case that pages get
         invalidated after writeback finishes (so the file extent items must be
         up to date in the fs tree).
      
      Since these reasons are specific to Btrfs we have to do it in the Btrfs
      code before calling generic_remap_file_range_prep(). This also results
      in a simpler way of dealing with existing delalloc in the source/target
      ranges, specially for the deduplication case where we used to lock all
      the pages first and then if we found any dealloc for the range, or
      ordered extent, we would unlock the pages trigger writeback and wait for
      ordered extents to complete, then lock all the pages again and check if
      deduplication can be done. So now we get a simpler approach: lock the
      inodes, then trigger writeback and then wait for ordered extents to
      complete.
      
      So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
      to eliminate duplicated code, fix a few bugs and benefit from future bug
      fixes done there - for example the recent clone and dedupe bugs involving
      reflinking a partial EOF block got a counterpart fix in the generic
      helper, since it affected all filesystems supporting these operations,
      so we no longer need special checks in Btrfs for them.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      34a28e3d
    • Nikolay Borisov's avatar
      btrfs: Refactor main loop in extent_readpages · 61ed3a14
      Nikolay Borisov authored
      extent_readpages processes all pages in the readlist in batches of 16,
      this is implemented by a single for loop but thanks to an if condition
      the loop does 2 things based on whether we've filled the batch or not.
      Additionally due to the structure of the code there is an additional
      check which deals with partial batches.
      
      Streamline all of this by explicitly using two loops. The outter one is
      used to process all pages while the inner one just fills in the batch
      of 16 (currently). Due to this new structure the code guarantees that
      all pages are processed in the loop hence the code to deal with any
      leftovers is eliminated.
      
      This also enable the compiler to inline __extent_readpages:
      
      	./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for
      
      	add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
      	Function                                     old     new   delta
      	extent_readpages                             476    1136    +660
      	__extent_readpages                           820       -    -820
      	Total: Before=44315, After=44155, chg -0.36%
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61ed3a14
    • Nikolay Borisov's avatar
      btrfs: Remove 1st shrink/grow phase from balance · 15c82763
      Nikolay Borisov authored
      The first step of the rebalance process ensures there is 1MiB free on
      each device. This number seems rather small. And in fact when talking to
      the original authors their opinions were:
      
      "man that's a little bonkers"
      "i don't think we even need that code anymore"
      "I think it was there to make sure we had room for the blank 1M at the
      beginning. I bet it goes all the way back to v0"
      "we just don't need any of that tho, i say we just delete it"
      
      Clearly, this piece of code has lost its original intent throughout the
      years. It doesn't really bring any real practical benefits to the
      relocation process.
      
      Additionally, this patch makes the balance process more lightweight by
      removing a pair of shrink/grow operations which are rather expensive for
      heavily populated filesystems. This is mainly due to shrink requiring
      relocating block groups, involving heavy use of the btree.
      
      The intermediate shrink/grow can fail and leave the filesystem in a
      middle state that would need to be changed back by the user.
      Suggested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15c82763
    • Filipe Manana's avatar
      Btrfs: send, fix race with transaction commits that create snapshots · be6821f8
      Filipe Manana authored
      If we create a snapshot of a snapshot currently being used by a send
      operation, we can end up with send failing unexpectedly (returning
      -ENOENT error to user space for example). The following diagram shows
      how this happens.
      
                  CPU 1                                   CPU2                                CPU3
      
       btrfs_ioctl_send()
        (...)
                                           create_snapshot()
                                            -> creates snapshot of a
                                               root used by the send
                                               task
                                            btrfs_commit_transaction()
                                             create_pending_snapshot()
        __get_inode_info()
         btrfs_search_slot()
          btrfs_search_slot_get_root()
           down_read commit_root_sem
      
           get reference on eb of the
           commit root
            -> eb with bytenr == X
      
           up_read commit_root_sem
      
                                              btrfs_cow_block(root node)
                                               btrfs_free_tree_block()
                                                -> creates delayed ref to
                                                   free the extent
      
                                             btrfs_run_delayed_refs()
                                              -> runs the delayed ref,
                                                 adds extent to
                                                 fs_info->pinned_extents
      
                                             btrfs_finish_extent_commit()
                                              unpin_extent_range()
                                               -> marks extent as free
                                                  in the free space cache
      
                                            transaction commit finishes
      
                                                                             btrfs_start_transaction()
                                                                              (...)
                                                                              btrfs_cow_block()
                                                                               btrfs_alloc_tree_block()
                                                                                btrfs_reserve_extent()
                                                                                 -> allocates extent at
                                                                                    bytenr == X
                                                                                btrfs_init_new_buffer(bytenr X)
                                                                                 btrfs_find_create_tree_block()
                                                                                  alloc_extent_buffer(bytenr X)
                                                                                   find_extent_buffer(bytenr X)
                                                                                    -> returns existing eb,
                                                                                       which the send task got
      
                                                                              (...)
                                                                               -> modifies content of the
                                                                                  eb with bytenr == X
      
          -> uses an eb that now
             belongs to some other
             tree and no more matches
             the commit root of the
             snapshot, resuts will be
             unpredictable
      
      The consequences of this race can be various, and can lead to searches in
      the commit root performed by the send task failing unexpectedly (unable to
      find inode items, returning -ENOENT to user space, for example) or not
      failing because an inode item with the same number was added to the tree
      that reused the metadata extent, in which case send can behave incorrectly
      in the worst case or just fail later for some reason.
      
      Fix this by performing a copy of the commit root's extent buffer when doing
      a search in the context of a send operation.
      
      CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e: Btrfs: move get root out of btrfs_search_slot to a helper
      CC: stable@vger.kernel.org # 4.4.x: f9ddfd05: Btrfs: remove unused check of skip_locking
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be6821f8
    • Filipe Manana's avatar
      Btrfs: use nofs context when initializing security xattrs to avoid deadlock · 827aa18e
      Filipe Manana authored
      When initializing the security xattrs, we are holding a transaction handle
      therefore we need to use a GFP_NOFS context in order to avoid a deadlock
      with reclaim in case it's triggered.
      
      Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      827aa18e
    • Josef Bacik's avatar
      btrfs: run delayed items before dropping the snapshot · 0568e82d
      Josef Bacik authored
      With my delayed refs patches in place we started seeing a large amount
      of aborts in __btrfs_free_extent:
      
       BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964  owner 1 offset 0
       Call Trace:
        ? btrfs_merge_delayed_refs+0xaf/0x340
        __btrfs_run_delayed_refs+0x6ea/0xfc0
        ? btrfs_set_path_blocking+0x31/0x60
        btrfs_run_delayed_refs+0xeb/0x180
        btrfs_commit_transaction+0x179/0x7f0
        ? btrfs_check_space_for_delayed_refs+0x30/0x50
        ? should_end_transaction.isra.19+0xe/0x40
        btrfs_drop_snapshot+0x41c/0x7c0
        btrfs_clean_one_deleted_snapshot+0xb5/0xd0
        cleaner_kthread+0xf6/0x120
        kthread+0xf8/0x130
        ? btree_invalidatepage+0x90/0x90
        ? kthread_bind+0x10/0x10
        ret_from_fork+0x35/0x40
      
      This was because btrfs_drop_snapshot depends on the root not being
      modified while it's dropping the snapshot.  It will unlock the root node
      (and really every node) as it walks down the tree, only to re-lock it
      when it needs to do something.  This is a problem because if we modify
      the tree we could cow a block in our path, which frees our reference to
      that block.  Then once we get back to that shared block we'll free our
      reference to it again, and get ENOENT when trying to lookup our extent
      reference to that block in __btrfs_free_extent.
      
      This is ultimately happening because we have delayed items left to be
      processed for our deleted snapshot _after_ all of the inodes are closed
      for the snapshot.  We only run the delayed inode item if we're deleting
      the inode, and even then we do not run the delayed insertions or delayed
      removals.  These can be run at any point after our final inode does its
      last iput, which is what triggers the snapshot deletion.  We can end up
      with the snapshot deletion happening and then have the delayed items run
      on that file system, resulting in the above problem.
      
      This problem has existed forever, however my patches made it much easier
      to hit as I wake up the cleaner much more often to deal with delayed
      iputs, which made us more likely to start the snapshot dropping work
      before the transaction commits, which is when the delayed items would
      generally be run.  Before, generally speaking, we would run the delayed
      items, commit the transaction, and wakeup the cleaner thread to start
      deleting snapshots, which means we were less likely to hit this problem.
      You could still hit it if you had multiple snapshots to be deleted and
      ended up with lots of delayed items, but it was definitely harder.
      
      Fix for now by simply running all the delayed items before starting to
      drop the snapshot.  We could make this smarter in the future by making
      the delayed items per-root, and then simply drop any delayed items for
      roots that we are going to delete.  But for now just a quick and easy
      solution is the safest.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0568e82d
    • Josef Bacik's avatar
      btrfs: catch cow on deleting snapshots · 83354f07
      Josef Bacik authored
      When debugging some weird extent reference bug I suspected that we were
      changing a snapshot while we were deleting it, which could explain my
      bug.  This was indeed what was happening, and this patch helped me
      verify my theory.  It is never correct to modify the snapshot once it's
      being deleted, so mark the root when we are deleting it and make sure we
      complain about it when it happens.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83354f07
    • Qu Wenruo's avatar
      btrfs: extent-tree: cleanup one-shot usage of @blocksize in do_walk_down · 01e0da48
      Qu Wenruo authored
      @blocksize variable in do_walk_down() is only used once, really no need
      to declare it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      01e0da48
    • Filipe Manana's avatar
      Btrfs: scrub, move setup of nofs contexts higher in the stack · 7c3c7cb9
      Filipe Manana authored
      Since scrub workers only do memory allocation with GFP_KERNEL when they
      need to perform repair, we can move the recent setup of the nofs context
      up to scrub_handle_errored_block() instead of setting it up down the call
      chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
      removing some duplicate code and comment. So the only paths for which a
      scrub worker can do memory allocations using GFP_KERNEL are the following:
      
       scrub_bio_end_io_worker()
         scrub_block_complete()
           scrub_handle_errored_block()
             lock_full_stripe()
               insert_full_stripe_lock()
                 -> kmalloc with GFP_KERNEL
      
        scrub_bio_end_io_worker()
          scrub_block_complete()
            scrub_handle_errored_block()
              scrub_write_page_to_dev_replace()
                scrub_add_page_to_wr_bio()
                  -> kzalloc with GFP_KERNEL
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c3c7cb9
    • David Sterba's avatar
      btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex · 0e94c4f4
      David Sterba authored
      The scrub context is allocated with GFP_KERNEL and called from
      btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
      regarding reclaim that could try to flush filesystem data in order to
      get the memory. And the device_list_mutex is held during superblock
      commit, so this would cause a lockup.
      
      Move the alocation and initialization before any changes that require
      the mutex.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e94c4f4
    • David Sterba's avatar
      btrfs: scrub: pass fs_info to scrub_setup_ctx · 92f7ba43
      David Sterba authored
      We can pass fs_info directly as this is the only member of btrfs_device
      that's bing used inside scrub_setup_ctx.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92f7ba43
    • Josef Bacik's avatar
      btrfs: fix truncate throttling · 28bad212
      Josef Bacik authored
      We have a bunch of magic to make sure we're throttling delayed refs when
      truncating a file.  Now that we have a delayed refs rsv and a mechanism
      for refilling that reserve simply use that instead of all of this magic.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28bad212
    • Josef Bacik's avatar
      btrfs: don't run delayed refs in the end transaction logic · db2462a6
      Josef Bacik authored
      Over the years we have built up a lot of infrastructure to keep delayed
      refs in check, mostly by running them at btrfs_end_transaction() time.
      We have a lot of different maths we do to figure out how much, if we
      should do it inline or async, etc.  This existed because we had no
      feedback mechanism to force the flushing of delayed refs when they
      became a problem.  However with the enospc flushing infrastructure in
      place for flushing delayed refs when they put too much pressure on the
      enospc system we have this problem solved.  Rip out all of this code as
      it is no longer needed.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db2462a6
    • Josef Bacik's avatar
      btrfs: rework btrfs_check_space_for_delayed_refs · 64403612
      Josef Bacik authored
      Now with the delayed_refs_rsv we can now know exactly how much pending
      delayed refs space we need.  This means we can drastically simplify
      btrfs_check_space_for_delayed_refs by simply checking how much space we
      have reserved for the global rsv (which acts as a spill over buffer) and
      the delayed refs rsv.  If our total size is beyond that amount then we
      know it's time to commit the transaction and stop any more delayed refs
      from being generated.
      
      With the introduction of dealyed_refs_rsv infrastructure, namely
      btrfs_update_delayed_refs_rsv we now know exactly how much pending
      delayed refs space is required.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64403612
    • Josef Bacik's avatar
      btrfs: add new flushing states for the delayed refs rsv · 413df725
      Josef Bacik authored
      A nice thing we gain with the delayed refs rsv is the ability to flush
      the delayed refs on demand to deal with enospc pressure.  Add states to
      flush delayed refs on demand, and this will allow us to remove a lot of
      ad-hoc work around checking to see if we should commit the transaction
      to run our delayed refs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      413df725
    • Josef Bacik's avatar
      btrfs: update may_commit_transaction to use the delayed refs rsv · 4c8edbc7
      Josef Bacik authored
      Any space used in the delayed_refs_rsv will be freed up by a transaction
      commit, so instead of just counting the pinned space we also need to
      account for any space in the delayed_refs_rsv when deciding if it will
      make a different to commit the transaction to satisfy our space
      reservation.  If we have enough bytes to satisfy our reservation ticket
      then we are good to go, otherwise subtract out what space we would gain
      back by committing the transaction and compare that against the pinned
      space to make our decision.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c8edbc7
    • Josef Bacik's avatar
      btrfs: introduce delayed_refs_rsv · ba2c4d4e
      Josef Bacik authored
      Traditionally we've had voodoo in btrfs to account for the space that
      delayed refs may take up by having a global_block_rsv.  This works most
      of the time, except when it doesn't.  We've had issues reported and seen
      in production where sometimes the global reserve is exhausted during
      transaction commit before we can run all of our delayed refs, resulting
      in an aborted transaction.  Because of this voodoo we have equally
      dubious flushing semantics around throttling delayed refs which we often
      get wrong.
      
      So instead give them their own block_rsv.  This way we can always know
      exactly how much outstanding space we need for delayed refs.  This
      allows us to make sure we are constantly filling that reservation up
      with space, and allows us to put more precise pressure on the enospc
      system.  Instead of doing math to see if its a good time to throttle,
      the normal enospc code will be invoked if we have a lot of delayed refs
      pending, and they will be run via the normal flushing mechanism.
      
      For now the delayed_refs_rsv will hold the reservations for the delayed
      refs, the block group updates, and deleting csums.  We could have a
      separate rsv for the block group updates, but the csum deletion stuff is
      still handled via the delayed_refs so that will stay there.
      
      Historical background:
      
      The global reserve has grown to cover everything we don't reserve space
      explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
      if we're running short on space and when it's time to force a commit.  A
      failure rate of 20-40 file systems when we run hundreds of thousands of
      them isn't super high, but cleaning up this code will make things less
      ugly and more predictible.
      
      Thus the delayed refs rsv.  We always know how many delayed refs we have
      outstanding, and although running them generates more we can use the
      global reserve for that spill over, which fits better into it's desired
      use than a full blown reservation.  This first approach is to simply
      take how many times we're reserving space for and multiply that by 2 in
      order to save enough space for the delayed refs that could be generated.
      This is a niave approach and will probably evolve, but for now it works.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
      [ added background notes from the cover letter ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba2c4d4e
    • Josef Bacik's avatar
      btrfs: only track ref_heads in delayed_ref_updates · 158ffa36
      Josef Bacik authored
      We use this number to figure out how many delayed refs to run, but
      __btrfs_run_delayed_refs really only checks every time we need a new
      delayed ref head, so we always run at least one ref head completely no
      matter what the number of items on it.  Fix the accounting to only be
      adjusted when we add/remove a ref head.
      
      In addition to using this number to limit the number of delayed refs
      run, a future patch is also going to use it to calculate the amount of
      space required for delayed refs space reservation.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      158ffa36
    • Josef Bacik's avatar
      btrfs: cleanup extent_op handling · bedc6617
      Josef Bacik authored
      The cleanup_extent_op function actually would run the extent_op if it
      needed running, which made the name sort of a misnomer.  Change it to
      run_and_cleanup_extent_op, and move the actual cleanup work to
      cleanup_extent_op so it can be used by check_ref_cleanup() in order to
      unify the extent op handling.
      Reviewed-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bedc6617
    • Josef Bacik's avatar
      btrfs: add cleanup_ref_head_accounting helper · 07c47775
      Josef Bacik authored
      We were missing some quota cleanups in check_ref_cleanup, so break the
      ref head accounting cleanup into a helper and call that from both
      check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
      we don't screw up accounting in the future for other things that we add.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07c47775
    • Josef Bacik's avatar
      btrfs: add btrfs_delete_ref_head helper · d7baffda
      Josef Bacik authored
      We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
      into a helper and cleanup the calling functions.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7baffda
    • Johannes Thumshirn's avatar
      btrfs: use PAGE_ALIGNED instead of open-coding it · fdb1e121
      Johannes Thumshirn authored
      When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
      alignment and thus should use the PAGE_ALIGNED() macro instead of
      open-coding it.
      
      Convert all open-coded occurrences of PAGE_ALIGNED().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fdb1e121
    • Johannes Thumshirn's avatar
      btrfs: use offset_in_page instead of open-coding it · 7073017a
      Johannes Thumshirn authored
      Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
      offset into a page.
      
      So replace them by the offset_in_page() macro instead of open-coding it if
      they're not used as an alignment check.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7073017a
    • David Sterba's avatar
      btrfs: dev-replace: open code trivial locking helpers · cb5583dd
      David Sterba authored
      The dev-replace locking functions are now trivial wrappers around rw
      semaphore that can be used directly everywhere. No functional change.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb5583dd
    • David Sterba's avatar
      btrfs: dev-replace: remove custom read/write blocking scheme · 53176dde
      David Sterba authored
      After the rw semaphore has been added, the custom blocking using
      ::blocking_readers and ::read_lock_wq is redundant.
      
      The blocking logic in __btrfs_map_block is replaced by extending the
      time the semaphore is held, that has the same blocking effect on writes
      as the previous custom scheme that waited until ::blocking_readers was
      zero.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53176dde
    • David Sterba's avatar
      btrfs: dev-replace: swich locking to rw semaphore · 129827e3
      David Sterba authored
      This is the first part of removing the custom locking and waiting scheme
      used for device replace. It was probably copied from extent buffer
      locking, but there's nothing that would require more than is provided by
      the common locking primitives.
      
      The rw spinlock protects waiting tasks counter in case of incompatible
      locks and the waitqueue. Same as rw semaphore.
      
      This patch only switches the locking primitive, for better
      bisectability.  There should be no functional change other than the
      overhead of the locking and potential sleeping instead of spinning when
      the lock is contended.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      129827e3
    • David Sterba's avatar
      btrfs: reada: reorder dev-replace locks before radix tree preload · ceb21a8d
      David Sterba authored
      The device-replace read lock is going to use rw semaphore in followup
      commits. The semaphore might sleep which is not possible in the radix
      tree preload section. The lock nesting is now:
      
      * device replace
        * radix tree preload
          * readahead spinlock
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ceb21a8d
    • Nikolay Borisov's avatar
      btrfs: Fix error handling in btrfs_cleanup_ordered_extents · d1051d6e
      Nikolay Borisov authored
      Running btrfs/124 in a loop hung up on me sporadically with the
      following call trace:
      
      	btrfs           D    0  5760   5324 0x00000000
      	Call Trace:
      	 ? __schedule+0x243/0x800
      	 schedule+0x33/0x90
      	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
      	 ? wait_woken+0xa0/0xa0
      	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
      	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
      	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
      	 btrfs_balance+0xbeb/0x1740 [btrfs]
      	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
      	 btrfs_ioctl+0x1691/0x3110 [btrfs]
      	 ? lockdep_hardirqs_on+0xed/0x180
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? _raw_spin_unlock+0x24/0x30
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? do_vfs_ioctl+0xa5/0x6e0
      	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      	 do_vfs_ioctl+0xa5/0x6e0
      	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
      	 ksys_ioctl+0x3a/0x70
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x60/0x1b0
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This happens because during page writeback it's valid for
      writepage_delalloc to instantiate a delalloc range which doesn't belong
      to the page currently being written back.
      
      The reason this case is valid is due to find_lock_delalloc_range
      returning any available range after the passed delalloc_start and
      ignoring whether the page under writeback is within that range.
      
      In turn ordered extents (OE) are always created for the returned range
      from find_lock_delalloc_range. If, however, a failure occurs while OE
      are being created then the clean up code in btrfs_cleanup_ordered_extents
      will be called.
      
      Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
      the case of such 'foreign' range being processed and instead it always
      assumes that the range OE are created for belongs to the page. This
      leads to the first page of such foregin range to not be cleaned up since
      it's deliberately missed and skipped by the current cleaning up code.
      
      Fix this by correctly checking whether the current page belongs to the
      range being instantiated and if so adjsut the range parameters passed
      for cleaning up. If it doesn't, then just clean the whole OE range
      directly.
      
      Fixes: 52427260 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1051d6e
    • Lu Fengqi's avatar
      btrfs: remove always true if branch in find_delalloc_range · 3522e903
      Lu Fengqi authored
      The @found is always false when it comes to the if branch. Besides, the
      bool type is more suitable for @found. Change the return value of the
      function and its caller to bool as well.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3522e903
    • Lu Fengqi's avatar
      btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow · 27a7ff55
      Lu Fengqi authored
      The test case btrfs/001 with inode_cache mount option will encounter the
      following warning:
      
        WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
        CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G        W  O      4.20.0-rc4-custom+ #30
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
        Call Trace:
         ? free_extent_buffer+0x46/0x90 [btrfs]
         run_delalloc_nocow+0x455/0x900 [btrfs]
         btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
         writepage_delalloc+0xf9/0x150 [btrfs]
         __extent_writepage+0x125/0x3e0 [btrfs]
         extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
         ? __wake_up_common_lock+0x63/0xc0
         extent_writepages+0x50/0x80 [btrfs]
         do_writepages+0x41/0xd0
         ? __filemap_fdatawrite_range+0x9e/0xf0
         __filemap_fdatawrite_range+0xbe/0xf0
         btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
         __btrfs_write_out_cache+0x42c/0x480 [btrfs]
         btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
         btrfs_save_ino_cache+0x551/0x660 [btrfs]
         commit_fs_roots+0xc5/0x190 [btrfs]
         btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
         btrfs_mksubvol+0x48d/0x4d0 [btrfs]
         btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
         btrfs_ioctl+0x123f/0x3030 [btrfs]
      
      The file extent generation of the free space inode is equal to the last
      snapshot of the file root, so the inode will be passed to cow_file_rage.
      But the inode was created and its extents were preallocated in
      btrfs_save_ino_cache, there are no cow copies on disk.
      
      The preallocated extent is not yet in the extent tree, and
      btrfs_cross_ref_exist will ignore the -ENOENT returned by
      check_committed_ref, so we can directly write the inode to the disk.
      
      Fixes: 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
      CC: stable@vger.kernel.org # 4.18+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27a7ff55
    • Filipe Manana's avatar
      Btrfs: fix fsync of files with multiple hard links in new directories · 41bd6067
      Filipe Manana authored
      The log tree has a long standing problem that when a file is fsync'ed we
      only check for new ancestors, created in the current transaction, by
      following only the hard link for which the fsync was issued. We follow the
      ancestors using the VFS' dget_parent() API. This means that if we create a
      new link for a file in a directory that is new (or in an any other new
      ancestor directory) and then fsync the file using an old hard link, we end
      up not logging the new ancestor, and on log replay that new hard link and
      ancestor do not exist. In some cases, involving renames, the file will not
      exist at all.
      
      Example:
      
        mkfs.btrfs -f /dev/sdb
        mount /dev/sdb /mnt
      
        mkdir /mnt/A
        touch /mnt/foo
        ln /mnt/foo /mnt/A/bar
        xfs_io -c fsync /mnt/foo
      
        <power failure>
      
      In this example after log replay only the hard link named 'foo' exists
      and directory A does not exist, which is unexpected. In other major linux
      filesystems, such as ext4, xfs and f2fs for example, both hard links exist
      and so does directory A after mounting again the filesystem.
      
      Checking if any new ancestors are new and need to be logged was added in
      2009 by commit 12fcfd22 ("Btrfs: tree logging unlink/rename fixes"),
      however only for the ancestors of the hard link (dentry) for which the
      fsync was issued, instead of checking for all ancestors for all of the
      inode's hard links.
      
      So fix this by tracking the id of the last transaction where a hard link
      was created for an inode and then on fsync fallback to a full transaction
      commit when an inode has more than one hard link and at least one new hard
      link was created in the current transaction. This is the simplest solution
      since this is not a common use case (adding frequently hard links for
      which there's an ancestor created in the current transaction and then
      fsync the file). In case it ever becomes a common use case, a solution
      that consists of iterating the fs/subvol btree for each hard link and
      check if any ancestor is new, could be implemented.
      
      This solves many unexpected scenarios reported by Jayashree Mohan and
      Vijay Chidambaram, and for which there is a new test case for fstests
      under review.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: default avatarVijay Chidambaram <vvijay03@gmail.com>
      Reported-by: default avatarJayashree Mohan <jayashree2912@gmail.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      41bd6067
    • David Sterba's avatar
      btrfs: drop extra enum initialization where using defaults · bbe339cc
      David Sterba authored
      The first auto-assigned value to enum is 0, we can use that and not
      initialize all members where the auto-increment does the same. This is
      used for values that are not part of on-disk format.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbe339cc
    • David Sterba's avatar
      btrfs: switch BTRFS_ORDERED_* to enums · 5b840301
      David Sterba authored
      We can use simple enum for values that are not part of on-disk format:
      ordered extent flags.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b840301
    • David Sterba's avatar
      btrfs: switch EXTENT_FLAG_* to enums · 50b5b602
      David Sterba authored
      We can use simple enum for values that are not part of on-disk format:
      extent map flags.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50b5b602
    • David Sterba's avatar
      btrfs: switch EXTENT_BUFFER_* to enums · 80cb3836
      David Sterba authored
      We can use simple enum for values that are not part of on-disk format:
      extent buffer flags;
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80cb3836
    • David Sterba's avatar
      btrfs: switch BTRFS_ROOT_* to enums · 61fa90c1
      David Sterba authored
      We can use simple enum for values that are not part of on-disk format:
      root tree flags.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61fa90c1
    • David Sterba's avatar
      btrfs: switch BTRFS_FS_* to enums · eb1a524c
      David Sterba authored
      We can use simple enum for values that are not part of on-disk format:
      internal filesystem states.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb1a524c
    • David Sterba's avatar
      btrfs: switch BTRFS_BLOCK_RSV_* to enums · 688a75b9
      David Sterba authored
      We can use simple enum for values that are not part of on-disk format:
      block reserve types.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      688a75b9
    • David Sterba's avatar
      btrfs: switch BTRFS_FS_STATE_* to enums · b00146b5
      David Sterba authored
      We can use simple enum for values that are not part of on-disk format:
      global filesystem states.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b00146b5
    • Nikolay Borisov's avatar
      btrfs: Refactor btrfs_merge_bio_hook · da12fe54
      Nikolay Borisov authored
      This function really checks whether adding more data to the bio will
      straddle a stripe/chunk. So first let's give it a more appropraite name
      - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
      used to just remove it. Thirdly, pages are submitted to either btree or
      data inodes so it's guaranteed that tree->ops is set so replace the
      check with an ASSERT. Finally, document the parameters of the function.
      No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da12fe54