1. 07 Apr, 2014 4 commits
    • Josef Bacik's avatar
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik authored
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9e351cc8
    • Josef Bacik's avatar
      Btrfs: don't clear uptodate if the eb is under IO · a26e8c9f
      Josef Bacik authored
      So I have an awful exercise script that will run snapshot, balance and
      send/receive in parallel.  This sometimes would crash spectacularly and when it
      came back up the fs would be completely hosed.  Turns out this is because of a
      bad interaction of balance and send/receive.  Send will hold onto its entire
      path for the whole send, but its blocks could get relocated out from underneath
      it, and because it doesn't old tree locks theres nothing to keep this from
      happening.  So it will go to read in a slot with an old transid, and we could
      have re-allocated this block for something else and it could have a completely
      different transid.  But because we think it is invalid we clear uptodate and
      re-read in the block.  If we do this before we actually write out the new block
      we could write back stale data to the fs, and boom we're screwed.
      
      Now we definitely need to fix this disconnect between send and balance, but we
      really really need to not allow ourselves to accidently read in stale data over
      new data.  So make sure we check if the extent buffer is not under io before
      clearing uptodate, this will kick back EIO to the caller instead of reading in
      stale data and keep us from corrupting the fs.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a26e8c9f
    • Josef Bacik's avatar
      Btrfs: check for an extent_op on the locked ref · 573a0755
      Josef Bacik authored
      We could have possibly added an extent_op to the locked_ref while we dropped
      locked_ref->lock, so check for this case as well and loop around.  Otherwise we
      could lose flag updates which would lead to extent tree corruption.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      573a0755
    • Josef Bacik's avatar
      Btrfs: do not reset last_snapshot after relocation · ba8b0289
      Josef Bacik authored
      This was done to allow NO_COW to continue to be NO_COW after relocation but it
      is not right.  When relocating we will convert blocks to FULL_BACKREF that we
      relocate.  We can leave some of these full backref blocks behind if they are not
      cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
      then just drop the reloc tree.  Then when we go to cow the block again we won't
      lookup the extent flags because we won't think there has been a snapshot
      recently which means we will do our normal ref drop thing instead of adding back
      a tree ref and dropping the shared ref.  This will cause btrfs_free_extent to
      blow up because it can't find the ref we are trying to free.  This was found
      with my ref verifying tool.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ba8b0289
  2. 22 Mar, 2014 1 commit
    • Liu Bo's avatar
      Btrfs: fix a crash of clone with inline extents's split · 00fdf13a
      Liu Bo authored
      xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
      of inline extents in __btrfs_drop_extents().
      
      For inline extents, we cannot duplicate another EXTENT_DATA item, because
      it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
      
      We have set limitations for the source inode's compressed inline extents,
      because it needs to decompress and recompress.  Now the destination inode's
      inline extents also need similar limitations.
      
      With this, xfstests btrfs/035 doesn't run into panic.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      00fdf13a
  3. 21 Mar, 2014 12 commits
    • Chris Mason's avatar
      btrfs: fix uninit variable warning · 73b802f4
      Chris Mason authored
      fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this
      function
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      73b802f4
    • Josef Bacik's avatar
      Btrfs: take into account total references when doing backref lookup · 44853868
      Josef Bacik authored
      I added an optimization for large files where we would stop searching for
      backrefs once we had looked at the number of references we currently had for
      this extent.  This works great most of the time, but for snapshots that point to
      this extent and has changes in the original root this assumption falls on it
      face.  So keep track of any delayed ref mods made and add in the actual ref
      count as reported by the extent item and use that to limit how far down an inode
      we'll search for extents.  Thanks,
      Reportedy-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reported-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Tested-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      44853868
    • Filipe Manana's avatar
      Btrfs: part 2, fix incremental send's decision to delay a dir move/rename · bfa7e1f8
      Filipe Manana authored
      For an incremental send, fix the process of determining whether the directory
      inode we're currently processing needs to have its move/rename operation delayed.
      
      We were ignoring the fact that if the inode's new immediate ancestor has a higher
      inode number than ours but wasn't renamed/moved, we might still need to delay our
      move/rename, because some other ancestor directory higher in the hierarchy might
      have an inode number higher than ours *and* was renamed/moved too - in this case
      we have to wait for rename/move of that ancestor to happen before our current
      directory's rename/move operation.
      
      Simple steps to reproduce this issue:
      
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir -p /mnt/a/x1/x2
            $ mkdir /mnt/a/Z
            $ mkdir -p /mnt/a/x1/x2/x3/x4/x5
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
            $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      The incremental send caused the kernel code to enter an infinite loop when
      building the path string for directory Z after its references are processed.
      
      A more complex scenario:
      
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir -p /mnt/a/b/c/d
            $ mkdir /mnt/a/b/c/d/e
            $ mkdir /mnt/a/b/c/d/f
            $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
            $ mkdir /mmt/a/b/c/g
            $ mv /mnt/a/b/c/d /mnt/a/b/D2
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mkdir /mnt/a/o
            $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
            $ mv /mnt/a/b/D2 /mnt/a/b/dd
            $ mv /mnt/a/b/c /mnt/a/C2
            $ mv /mnt/a/b/dd/f /mnt/a/o/FF
            $ mv /mnt/a/b /mnt/a/o/FF/E2/BB
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bfa7e1f8
    • Filipe Manana's avatar
      Btrfs: fix incremental send's decision to delay a dir move/rename · 7b119a8b
      Filipe Manana authored
      It's possible to change the parent/child relationship between directories
      in such a way that if a child directory has a higher inode number than
      its parent, it doesn't necessarily means the child rename/move operation
      can be performed immediately. The parent migth have its own rename/move
      operation delayed, therefore in this case the child needs to have its
      rename/move operation delayed too, and be performed after its new parent's
      rename/move.
      
      Steps to reproduce the issue:
      
            $ umount /mnt
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir /mnt/A
            $ mkdir /mnt/B
            $ mkdir /mnt/C
            $ mv /mnt/C /mnt/A
            $ mv /mnt/B /mnt/A/C
            $ mkdir /mnt/A/C/D
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mv /mnt/A/C/D /mnt/A/D2
            $ mv /mnt/A/C/B /mnt/A/D2/B2
            $ mv /mnt/A/C /mnt/A/D2/B2/C2
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      The incremental send caused the kernel code to enter an infinite loop when
      building the path string for directory C after its references are processed.
      
      The necessary conditions here are that C has an inode number higher than both
      A and B, and B as an higher inode number higher than A, and D has the highest
      inode number, that is:
          inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)
      
      The same issue could happen if after the first snapshot there's any number
      of intermediary parent directories between A2 and B2, and between B2 and C2.
      
      A test case for xfstests follows, covering this simple case and more advanced
      ones, with files and hard links created inside the directories.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7b119a8b
    • Filipe Manana's avatar
      Btrfs: remove unnecessary inode generation lookup in send · 425b5daf
      Filipe Manana authored
      No need to search in the send tree for the generation number of the inode,
      we already have it in the recorded_ref structure passed to us.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      425b5daf
    • Filipe Manana's avatar
      Btrfs: fix race when updating existing ref head · 21543bad
      Filipe Manana authored
      While we update an existing ref head's extent_op, we're not holding
      its spinlock, so while we're updating its extent_op contents (key,
      flags) we can have a task running __btrfs_run_delayed_refs() that
      holds the ref head's lock and sets its extent_op to NULL right after
      the task updating the ref head just checked its extent_op was not NULL.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      21543bad
    • Qu Wenruo's avatar
      btrfs: Add trace for btrfs_workqueue alloc/destroy · c3a46891
      Qu Wenruo authored
      Since most of the btrfs_workqueue is printed as pointer address,
      for easier analysis, add trace for btrfs_workqueue alloc/destroy.
      So it is possible to determine the workqueue that a given work belongs
      to(by comparing the wq pointer address with alloc trace event).
      Signed-off-by: default avatarQu Wenruo <quenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c3a46891
    • Filipe Manana's avatar
      Btrfs: less fs tree lock contention when using autodefrag · f094c9bd
      Filipe Manana authored
      When finding new extents during an autodefrag, don't do so many fs tree
      lookups to find an extent with a size smaller then the target treshold.
      Instead, after each fs tree forward search immediately unlock upper
      levels and process the entire leaf while holding a read lock on the leaf,
      since our leaf processing is very fast.
      This reduces lock contention, allowing for higher concurrency when other
      tasks want to write/update items related to other inodes in the fs tree,
      as we're not holding read locks on upper tree levels while processing the
      leaf and we do less tree searches.
      
      Test:
      
          sysbench --test=fileio --file-num=512 --file-total-size=16G \
             --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
             --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
             --max-requests=10000000000 [prepare|run]
      
      (fileystem mounted with -o autodefrag, averages of 5 runs)
      
      Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
      After this change:  63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb
      
      Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f094c9bd
    • Guangyu Sun's avatar
      Btrfs: return EPERM when deleting a default subvolume · 72de6b53
      Guangyu Sun authored
      The error message is confusing:
      
       # btrfs sub delete /mnt/mysub/
       Delete subvolume '/mnt/mysub'
       ERROR: cannot delete '/mnt/mysub' - Directory not empty
      
      The error message does not make sense to me: It's not about deleting a
      directory but it's a subvolume, and it doesn't matter if the subvolume is
      empty or not.
      
      Maybe EPERM or is more appropriate in this case, combined with an explanatory
      kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
      it is configured as default subvolume.")
      Reported-by: default avatarKoen De Wit <koen.de.wit@oracle.com>
      Signed-off-by: default avatarGuangyu Sun <guangyu.sun@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      72de6b53
    • Filipe Manana's avatar
    • Filipe Manana's avatar
      Btrfs: cache extent states in defrag code path · 308d9800
      Filipe Manana authored
      When locking file ranges in the inode's io_tree, cache the first
      extent state that belongs to the target range, so that when unlocking
      the range we don't need to search in the io_tree again, reducing cpu
      time and making and therefore holding the io_tree's lock for a shorter
      period.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      308d9800
    • Josef Bacik's avatar
      Btrfs: fix deadlock with nested trans handles · 3bbb24b2
      Josef Bacik authored
      Zach found this deadlock that would happen like this
      
      btrfs_end_transaction <- reduce trans->use_count to 0
        btrfs_run_delayed_refs
          btrfs_cow_block
            find_free_extent
      	btrfs_start_transaction <- increase trans->use_count to 1
                allocate chunk
      	btrfs_end_transaction <- decrease trans->use_count to 0
      	  btrfs_run_delayed_refs
      	    lock tree block we are cowing above ^^
      
      We need to only decrease trans->use_count if it is above 1, otherwise leave it
      alone.  This will make nested trans be the only ones who decrease their added
      ref, and will let us get rid of the trans->use_count++ hack if we have to commit
      the transaction.  Thanks,
      
      cc: stable@vger.kernel.org
      Reported-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Tested-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3bbb24b2
  4. 10 Mar, 2014 23 commits