1. 10 Mar, 2014 4 commits
    • Miao Xie's avatar
      Btrfs: fix use-after-free in the finishing procedure of the device replace · c404e0dc
      Miao Xie authored
      During device replace test, we hit a null pointer deference (It was very easy
      to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
      scsi driver). There were two bugs that caused this problem:
      - We might allocate new chunks on the replaced device after we updated
        the mapping tree. And we forgot to replace the source device in those
        mapping of the new chunks.
      - We might get the mapping information which including the source device
        before the mapping information update. And then submit the bio which was
        based on that mapping information after we freed the source device.
      
      For the first bug, we can fix it by doing mapping tree update and source
      device remove in the same context of the chunk mutex. The chunk mutex is
      used to protect the allocable device list, the above method can avoid
      the new chunk allocation, and after we remove the source device, all
      the new chunks will be allocated on the new device. So it can fix
      the first bug.
      
      For the second bug, we need make sure all flighting bios are finished and
      no new bios are produced during we are removing the source device. To fix
      this problem, we introduced a global @bio_counter, we not only inc/dec
      @bio_counter outsize of map_blocks, but also inc it before submitting bio
      and dec @bio_counter when ending bios.
      
      Since Raid56 is a little different and device replace dosen't support raid56
      yet, it is not addressed in the patch and I add comments to make sure we will
      fix it in the future.
      Reported-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      c404e0dc
    • Miao Xie's avatar
      Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace · 391cd9df
      Miao Xie authored
      the alloc list of the filesystem is protected by ->chunk_mutex, we need
      get that mutex when we insert the new device into the list.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      391cd9df
    • Kusanagi Kouichi's avatar
      btrfs: Return EXDEV for cross file system snapshot · 23ad5b17
      Kusanagi Kouichi authored
      EXDEV seems an appropriate error if an operation fails bacause it
      crosses file system boundaries.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarKusanagi Kouichi <slash@ac.auone-net.jp>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      23ad5b17
    • Miao Xie's avatar
      Btrfs: don't mix the ordered extents of all files together during logging the inodes · 827463c4
      Miao Xie authored
      There was a problem in the old code:
      If we failed to log the csum, we would free all the ordered extents in the log list
      including those ordered extents that were logged successfully, it would make the
      log committer not to wait for the completion of the ordered extents.
      
      This patch doesn't insert the ordered extents that is about to be logged into
      a global list, instead, we insert them into a local list. If we log the ordered
      extents successfully, we splice them with the global list, or we will throw them
      away, then do full sync. It can also reduce the lock contention and the traverse
      time of list.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      827463c4
  2. 15 Feb, 2014 2 commits
    • Filipe David Borba Manana's avatar
      Btrfs: use right clone root offset for compressed extents · 93de4ba8
      Filipe David Borba Manana authored
      For non compressed extents, iterate_extent_inodes() gives us offsets
      that take into account the data offset from the file extent items, while
      for compressed extents it doesn't. Therefore we have to adjust them before
      placing them in a send clone instruction. Not doing this adjustment leads to
      the receiving end requesting for a wrong a file range to the clone ioctl,
      which results in different file content from the one in the original send
      root.
      
      Issue reproducible with the following excerpt from the test I made for
      xfstests:
      
        _scratch_mkfs
        _scratch_mount "-o compress-force=lzo"
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
      
        $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT
        $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo
      
        # will be used for incremental send to be able to issue clone operations
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
      
        $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
        $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
            -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2
        $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \
            -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2
      
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap
        $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \
            -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap
      
        _scratch_unmount
        _scratch_mkfs
        _scratch_mount
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
        $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap
        $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
        $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      93de4ba8
    • Anand Jain's avatar
      btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 · f085381e
      Anand Jain authored
      bdev is null when disk has disappeared and mounted with
      the degrade option
      
      stack trace
      ---------
      btrfs_sysfs_add_one+0x105/0x1c0 [btrfs]
      open_ctree+0x15f3/0x1fe0 [btrfs]
      btrfs_mount+0x5db/0x790 [btrfs]
      ? alloc_pages_current+0xa4/0x160
      mount_fs+0x34/0x1b0
      vfs_kern_mount+0x62/0xf0
      do_mount+0x22e/0xa80
      ? __get_free_pages+0x9/0x40
      ? copy_mount_options+0x31/0x170
      SyS_mount+0x7e/0xc0
      system_call_fastpath+0x16/0x1b
      ---------
      
      reproducer:
      -------
      mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd
      (detach a disk)
      devmgt detach /dev/sdc [1]
      mount -o degrade /dev/sdd /btrfs
      -------
      
      [1] github.com/anajain/devmgt.git
      Signed-off-by: default avatarAnand Jain <Anand.Jain@oracle.com>
      Tested-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f085381e
  3. 14 Feb, 2014 4 commits
    • Josef Bacik's avatar
      Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol · 3a0dfa6a
      Josef Bacik authored
      A user was running into errors from an NFS export of a subvolume that had a
      default subvol set.  When we mount a default subvol we will use d_obtain_alias()
      to find an existing dentry for the subvolume in the case that the root subvol
      has already been mounted, or a dummy one is allocated in the case that the root
      subvol has not already been mounted.  This allows us to connect the dentry later
      on if we wander into the path.  However if we don't ever wander into the path we
      will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It doesn't
      appear to cause any problems but it is annoying nonetheless, so simply unset
      DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
      use d_materialise_unique() instead which will make everything play nicely
      together and reconnect stuff if we wander into the defaul subvol path from a
      different way.  With this patch I'm no longer getting the NFS errors when
      exporting a volume that has been mounted with a default subvol set.  Thanks,
      
      cc: bfields@fieldses.org
      cc: ebiederm@xmission.com
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3a0dfa6a
    • Mitch Harder's avatar
      Btrfs: fix max_inline mount option · feb5f965
      Mitch Harder authored
      Currently, the only mount option for max_inline that has any effect is
      max_inline=0.  Any other value that is supplied to max_inline will be
      adjusted to a minimum of 4k.  Since max_inline has an effective maximum
      of ~3900 bytes due to page size limitations, the current behaviour
      only has meaning for max_inline=0.
      
      This patch will allow the the max_inline mount option to accept non-zero
      values as indicated in the documentation.
      Signed-off-by: default avatarMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      feb5f965
    • Liu Bo's avatar
      Btrfs: fix a lockdep warning when cleaning up aborted transaction · a9d2d4ad
      Liu Bo authored
      Given now we have 2 spinlock for management of delayed refs,
      CONFIG_DEBUG_SPINLOCK=y helped me find this,
      
      [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
      [ 4723.414882]  lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
      [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G        W  O 3.12.0+ #4
      [ 4723.421321] Call Trace:
      [ 4723.421872]  [<ffffffff81680fe7>] dump_stack+0x54/0x74
      [ 4723.422753]  [<ffffffff81681093>] spin_dump+0x8c/0x91
      [ 4723.424979]  [<ffffffff816810b9>] spin_bug+0x21/0x26
      [ 4723.425846]  [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
      [ 4723.434424]  [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
      [ 4723.438747]  [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
      [ 4723.443321]  [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
      [ 4723.444692]  [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      [ 4723.450336]  [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
      [ 4723.451332]  [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
      [ 4723.452543]  [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
      [ 4723.457833]  [<ffffffff81079efa>] kthread+0xea/0xf0
      [ 4723.458990]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.460133]  [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [ 4723.460865]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.496521] ------------[ cut here ]------------
      
      ----------------------------------------------------------------------
      
      The reason is that we get to call cond_resched_lock(&head_ref->lock) while
      still holding @delayed_refs->lock.
      
      So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
      dance before and after actually processing delayed refs.
      
      Here we don't drop the lock, others are not able to add new delayed refs to
      head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a9d2d4ad
    • Chris Mason's avatar
      Revert "btrfs: add ioctl to export size of global metadata reservation" · 11bcac89
      Chris Mason authored
      This reverts commit 01e219e8.
      
      David Sterba found a different way to provide these features without adding a new
      ioctl.  We haven't released any progs with this ioctl yet, so I'm taking this out
      for now until we finalize things.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      CC: Jeff Mahoney <jeffm@suse.com>
      11bcac89
  4. 09 Feb, 2014 5 commits
    • Filipe David Borba Manana's avatar
      Btrfs: fix data corruption when reading/updating compressed extents · a2aa75e1
      Filipe David Borba Manana authored
      When using a mix of compressed file extents and prealloc extents, it
      is possible to fill a page of a file with random, garbage data from
      some unrelated previous use of the page, instead of a sequence of zeroes.
      
      A simple sequence of steps to get into such case, taken from the test
      case I made for xfstests, is:
      
         _scratch_mkfs
         _scratch_mount "-o compress-force=lzo"
         $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
      This results in the following file items in the fs tree:
      
         item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
             inode generation 6 transid 6 size 542872 block group 0 mode 100600
         item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
             inode ref index 2 namelen 6 name: foobar
         item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
             extent data disk byte 0 nr 0 gen 6
             extent data offset 0 nr 24576 ram 266240
             extent compression 0
         item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
             prealloc data disk byte 12849152 nr 241664 gen 6
             prealloc data offset 0 nr 241664
         item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
             extent data disk byte 12845056 nr 4096 gen 6
             extent data offset 0 nr 20480 ram 20480
             extent compression 2
         item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
             prealloc data disk byte 13090816 nr 405504 gen 6
             prealloc data offset 0 nr 258048
      
      The on disk extent at offset 266240 (which corresponds to 1 single disk block),
      contains 5 compressed chunks of file data. Each of the first 4 compress 4096
      bytes of file data, while the last one only compresses 3024 bytes of file data.
      Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
      1072 bytes) should always return zeroes (our next extent is a prealloc one).
      
      The solution here is the compression code path to zero the remaining (untouched)
      bytes of the last page it uncompressed data into, as the information about how
      much space the file data consumes in the last page is not known in the upper layer
      fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
      the remainder of the page but only if it corresponds to the last page of the inode
      and if the inode's size is not a multiple of the page size.
      
      This would cause not only returning random data on reads, but also permanently
      storing random data when updating parts of the region that should be zeroed.
      For the example above, it means updating a single byte in the region [285648 ; 286720[
      would store that byte correctly but also store random data on disk.
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a2aa75e1
    • Josef Bacik's avatar
      Btrfs: don't loop forever if we can't run because of the tree mod log · 27a377db
      Josef Bacik authored
      A user reported a 100% cpu hang with my new delayed ref code.  Turns out I
      forgot to increase the count check when we can't run a delayed ref because of
      the tree mod log.  If we can't run any delayed refs during this there is no
      point in continuing to look, and we need to break out.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      27a377db
    • David Sterba's avatar
      btrfs: reserve no transaction units in btrfs_ioctl_set_features · 8051aa1a
      David Sterba authored
      Added in patch "btrfs: add ioctls to query/change feature bits online"
      modifications to superblock don't need to reserve metadata blocks when
      starting a transaction.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8051aa1a
    • Jeff Mahoney's avatar
      btrfs: commit transaction after setting label and features · d0270aca
      Jeff Mahoney authored
      The set_fslabel ioctl uses btrfs_end_transaction, which means it's
      possible that the change will be lost if the system crashes, same for
      the newly set features. Let's use btrfs_commit_transaction instead.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d0270aca
    • Josef Bacik's avatar
      Btrfs: fix assert screwup for the pending move stuff · 6cc98d90
      Josef Bacik authored
      Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't
      reproduce.  Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set,
      which meant that a key part of Filipe's original patch was not being built in.
      This appears to be a mess up with merging Filipe's patch as it does not exist in
      his original patch.  Fix this by changing how we make sure del_waiting_dir_move
      asserts that it did not error and take the function out of the ifdef check.
      This makes btrfs/030 pass with the assert on or off.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6cc98d90
  5. 03 Feb, 2014 3 commits
  6. 29 Jan, 2014 14 commits
    • Chris Mason's avatar
      Btrfs: fix spin_unlock in check_ref_cleanup · cf93da7b
      Chris Mason authored
      Our goto out should have gone a little farther.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cf93da7b
    • Chris Mason's avatar
      Btrfs: setup inode location during btrfs_init_inode_locked · 90d3e592
      Chris Mason authored
      We have a race during inode init because the BTRFS_I(inode)->location is setup
      after the inode hash table lock is dropped.  btrfs_find_actor uses the location
      field, so our search might not find an existing inode in the hash table if we
      race with the inode init code.
      
      This commit changes things to setup the location field sooner.  Also the find actor now
      uses only the location objectid to match inodes.  For inode hashing, we just
      need a unique and stable test, it doesn't have to reflect the inode numbers we
      show to userland.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      90d3e592
    • Chris Mason's avatar
      Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad
      Chris Mason authored
      If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
      the new size.  The fixe uses the size directly from the item header when
      reading uncompressed inlines, and also fixes truncate to update the
      size as it goes.
      Reported-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      514ac8ad
    • Filipe David Borba Manana's avatar
      Btrfs: fix btrfs_search_slot_for_read backwards iteration · 23c6bf6a
      Filipe David Borba Manana authored
      If the current path's leaf slot is 0, we do search for the previous
      leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a
      value corresponding to the number of items - 1 of the former leaf.
      Fix this by using the slot set by btrfs_prev_leaf, decrementing it
      by 1 if it's equal to the leaf's number of items.
      
      Use of btrfs_search_slot_for_read() for backward iteration is used in
      particular by the send feature, which could miss items when the input
      leaf has less items than its previous leaf.
      
      This could be reproduced by running btrfs/007 from xfstests in a loop.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      23c6bf6a
    • Wang Shilong's avatar
      Btrfs: do not export ulist functions · 49fc647a
      Wang Shilong authored
      There are not any users that use ulist except Btrfs,don't
      export them.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      49fc647a
    • Wang Shilong's avatar
      Btrfs: rework ulist with list+rb_tree · 4c7a6f74
      Wang Shilong authored
      We are really suffering from now ulist's implementation, some developers
      gave their try, and i just gave some of my ideas for things:
      
       1. use list+rb_tree instead of arrary+rb_tree
      
       2. add cur_list to iterator rather than ulist structure.
      
       3. add seqnum into every node when they are added, this is
       used to do selfcheck when iterating node.
      
      I noticed Zach Brown's comments before, long term is to kick off
      ulist implementation, however, for now, we need at least avoid
      arrary from ulist.
      
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Zach Brown <zab@redhat.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4c7a6f74
    • Wang Shilong's avatar
      Btrfs: fix memory leaks on walking backrefs failure · f05c4746
      Wang Shilong authored
      When walking backrefs, we may iterate every inode's extent
      and add/merge them into ulist, and the caller will free memory
      from ulist.
      
      However, if we fail to allocate inode's extents element
      memory or ulist_add() fail to allocate memory, we won't
      add allocated memory into ulist, and the caller won't
      free some allocated memory thus memory leaks happen.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f05c4746
    • Filipe David Borba Manana's avatar
      Btrfs: fix send file hole detection leading to data corruption · bf54f412
      Filipe David Borba Manana authored
      There was a case where file hole detection was incorrect and it would
      cause an incremental send to override a section of a file with zeroes.
      
      This happened in the case where between the last leaf we processed which
      contained a file extent item for our current inode and the leaf we're
      currently are at (and has a file extent item for our current inode) there
      are only leafs containing exclusively file extent items for our current
      inode, and none of them was updated since the previous send operation.
      The file hole detection code would incorrectly consider the file range
      covered by these leafs as a hole.
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bf54f412
    • Wang Shilong's avatar
      Btrfs: add a reschedule point in btrfs_find_all_roots() · bca1a290
      Wang Shilong authored
      I can easily trigger the following warnings when enabling quota
      in my virtual machine(running Opensuse), Steps are firstly creating
      a subvolume full of fragment extents, and then create many snapshots
      (500 in my test case).
      
      [ 2362.808459] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-qgroup-re:1970]
      
      [ 2362.809023] task: e4af8450 ti: e371c000 task.ti: e371c000
      [ 2362.809026] EIP: 0060:[<fa38f4ae>] EFLAGS: 00000246 CPU: 0
      [ 2362.809049] EIP is at __merge_refs+0x5e/0x100 [btrfs]
      [ 2362.809051] EAX: 00000000 EBX: cfadbcf0 ECX: 00000000 EDX: cfadbcb0
      [ 2362.809052] ESI: dd8d3370 EDI: e371dde0 EBP: e371dd6c ESP: e371dd5c
      [ 2362.809054]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      [ 2362.809055] CR0: 80050033 CR2: ac454d50 CR3: 009a9000 CR4: 001407d0
      [ 2362.809099] Stack:
      [ 2362.809100]  00000001 e371dde0 dfcc6890 f29f8000 e371de28 fa39016d 00000011 00000001
      [ 2362.809105]  99bfc000 00000000 93928000 00000000 00000001 00000050 e371dda8 00000001
      [ 2362.809109]  f3a31000 f3413000 00000001 e371ddb8 000040a8 00000202 00000000 00000023
      [ 2362.809113] Call Trace:
      [ 2362.809136]  [<fa39016d>] find_parent_nodes+0x34d/0x1280 [btrfs]
      [ 2362.809156]  [<fa391172>] btrfs_find_all_roots+0xb2/0x110 [btrfs]
      [ 2362.809174]  [<fa3934a8>] btrfs_qgroup_rescan_worker+0x358/0x7a0 [btrfs]
      [ 2362.809180]  [<c024d0ce>] ? lock_timer_base.isra.39+0x1e/0x40
      [ 2362.809199]  [<fa3648df>] worker_loop+0xff/0x470 [btrfs]
      [ 2362.809204]  [<c027a88a>] ? __wake_up_locked+0x1a/0x20
      [ 2362.809221]  [<fa3647e0>] ? btrfs_queue_worker+0x2b0/0x2b0 [btrfs]
      [ 2362.809225]  [<c025ebbc>] kthread+0x9c/0xb0
      [ 2362.809229]  [<c06b487b>] ret_from_kernel_thread+0x1b/0x30
      [ 2362.809233]  [<c025eb20>] ? kthread_create_on_node+0x110/0x110
      
      By adding a reschedule point at the end of btrfs_find_all_roots(), i no longer
      hit these warnings.
      
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bca1a290
    • Filipe David Borba Manana's avatar
      Btrfs: make send's file extent item search more efficient · 7fdd29d0
      Filipe David Borba Manana authored
      Instead of looking for a file extent item, process it, release the path
      and do a btree search for the next file extent item, just process all
      file extent items in a leaf without intermediate btree searches. This way
      we save cpu and we're not blocking other tasks or affecting concurrency on
      the btree, because send's paths use the commit root and skip btree node/leaf
      locking.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7fdd29d0
    • Wang Shilong's avatar
      Btrfs: fix to catch all errors when resolving indirect ref · 95def2ed
      Wang Shilong authored
      We can only tolerate ENOENT here, for other errors, we should
      return directly.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      95def2ed
    • Wang Shilong's avatar
      Btrfs: fix protection between walking backrefs and root deletion · 538f72cd
      Wang Shilong authored
      There is a race condition between resolving indirect ref and root deletion,
      and we should gurantee that root can not be destroyed to avoid accessing
      broken tree here.
      
      Here we fix it by holding @subvol_srcu, and we will release it as soon
      as we have held root node lock.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      538f72cd
    • Gui Hecheng's avatar
      btrfs: fix warning while merging two adjacent extents · 3c9665df
      Gui Hecheng authored
      When we have two adjacent extents in relink_extent_backref,
      we try to merge them. When we use btrfs_search_slot to locate the
      slot for the current extent, we shouldn't set "ins_len = 1",
      because we will merge it into the previous extent rather than
      insert a new item. Otherwise, we may happen to create a new leaf
      in btrfs_search_slot and path->slot[0] will be 0. Then we try to
      fetch the previous item using "path->slots[0]--", and it will cause
      a warning as follows:
      
      	[  145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0
      	[  145.713387] btrfs bad mapping eb start 53370886 len 4096, wanted 167772306 8
      	...
      	[  145.713462]  [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0
      	[  145.713476]  [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40
      	[  145.713485]  [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0
      	[  145.713498]  [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820
      	[  145.713508]  [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80
      
      I encounter this warning when running defrag having mkfs.btrfs
      with option -M. At the same time there are read/writes & snapshots
      running at background.
      Signed-off-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3c9665df
    • Filipe David Borba Manana's avatar
      Btrfs: fix infinite path build loops in incremental send · 9f03740a
      Filipe David Borba Manana authored
      The send operation processes inodes by their ascending number, and assumes
      that any rename/move operation can be successfully performed (sent to the
      caller) once all previous inodes (those with a smaller inode number than the
      one we're currently processing) were processed.
      
      This is not true when an incremental send had to process an hierarchical change
      between 2 snapshots where the parent-children relationship between directory
      inodes was reversed - that is, parents became children and children became
      parents. This situation made the path building code go into an infinite loop,
      which kept allocating more and more memory that eventually lead to a krealloc
      warning being displayed in dmesg:
      
        WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
        Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
        CPU: 1 PID: 5705 Comm: btrfs Tainted: G           O 3.13.0-rc7-fdm-btrfs-next-18+ #3
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
        [ 5381.660441]  00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
        [ 5381.660447]  0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
        [ 5381.660452]  0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
        [ 5381.660457] Call Trace:
        [ 5381.660464]  [<ffffffff81777434>] dump_stack+0x4e/0x68
        [ 5381.660471]  [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
        [ 5381.660476]  [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
        [ 5381.660480]  [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
        [ 5381.660487]  [<ffffffff8108313f>] ? local_clock+0x4f/0x60
        [ 5381.660491]  [<ffffffff811430e8>] ? free_one_page+0x98/0x440
        [ 5381.660495]  [<ffffffff8108313f>] ? local_clock+0x4f/0x60
        [ 5381.660502]  [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
        [ 5381.660508]  [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
        [ 5381.660515]  [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
        [ 5381.660520]  [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
        [ 5381.660524]  [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
        [ 5381.660530]  [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
        [ 5381.660536]  [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
        [ 5381.660560]  [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
        [ 5381.660564]  [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
        [ 5381.660569]  [<ffffffff811580ef>] krealloc+0x6f/0xb0
        [ 5381.660586]  [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
        [ 5381.660601]  [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
        [ 5381.660615]  [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
        [ 5381.660628]  [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
      
      Even without this loop, the incremental send couldn't succeed, because it would attempt
      to send a rename/move operation for the lower inode before the highest inode number was
      renamed/move. This issue is easy to trigger with the following steps:
      
        $ mkfs.btrfs -f /dev/sdb3
        $ mount /dev/sdb3 /mnt/btrfs
        $ mkdir -p /mnt/btrfs/a/b/c/d
        $ mkdir /mnt/btrfs/a/b/c2
        $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
        $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
        $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
        $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
        $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
      
      The structure of the filesystem when the first snapshot is taken is:
      
      	 .                       (ino 256)
      	 |-- a                   (ino 257)
      	     |-- b               (ino 258)
      	         |-- c           (ino 259)
      	         |   |-- d       (ino 260)
                       |
      	         |-- c2          (ino 261)
      
      And its structure when the second snapshot is taken is:
      
      	 .                       (ino 256)
      	 |-- a                   (ino 257)
      	     |-- b               (ino 258)
      	         |-- c2          (ino 261)
      	             |-- d2      (ino 260)
      	                 |-- cc  (ino 259)
      
      Before the move/rename operation is performed for the inode 259, the
      move/rename for inode 260 must be performed, since 259 is now a child
      of 260.
      
      A test case for xfstests, with a more complex scenario, will follow soon.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9f03740a
  7. 28 Jan, 2014 8 commits