1. 28 Jan, 2014 40 commits
    • Wang Shilong's avatar
      Btrfs: fix to search previous metadata extent item since skinny metadata · ade2e0b3
      Wang Shilong authored
      There is a bug that using btrfs_previous_item() to search metadata extent item.
      This is because in btrfs_previous_item(), we need type match, however, since
      skinny metada was introduced by josef, we may mix this two types. So just
      use btrfs_previous_item() is not working right.
      
      To keep btrfs_previous_item() like normal tree search, i introduce another
      function btrfs_previous_extent_item().
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ade2e0b3
    • Wang Shilong's avatar
      Btrfs: fix missing skinny metadata check in scrub_stripe() · 7c76edb7
      Wang Shilong authored
      Check if we support skinny metadata firstly and fix to use
      right type to search.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7c76edb7
    • Filipe David Borba Manana's avatar
      Btrfs: fix send to not send non-aligned clone operations · 28e5dd8f
      Filipe David Borba Manana authored
      It is possible for the send feature to send clone operations that
      request a cloning range (offset + length) that is not aligned with
      the block size. This makes the btrfs receive command send issue a
      clone ioctl call that will fail, as the ioctl will return an -EINVAL
      error because of the unaligned range.
      
      Fix this by not sending clone operations for non block aligned ranges,
      and instead send regular write operation for these (less common) cases.
      
      The following xfstest reproduces this issue, which fails on the second
      btrfs receive command without this change:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        tmp=`mktemp -d`
      
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -fr $tmp
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _need_to_be_root
      
        rm -f $seqres.full
      
        _scratch_mkfs >/dev/null 2>&1
        _scratch_mount
      
        $XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \
            _filter_scratch
      
        $XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \
            _filter_scratch
      
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch
        $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
            -f $tmp/2.snap 2>&1 | _filter_scratch
      
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
        md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
      
        _scratch_unmount
        _check_btrfs_filesystem $SCRATCH_DEV
        _scratch_mkfs >/dev/null 2>&1
        _scratch_mount
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
        md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
        md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
      
        _scratch_unmount
        _check_btrfs_filesystem $SCRATCH_DEV
      
        status=0
        exit
      
      The tests expected output is:
      
        QA output created by 025
        FSSync 'SCRATCH_MNT'
        FSSync 'SCRATCH_MNT'
        wrote 2978/2978 bytes at offset 1482752
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        FSSync 'SCRATCH_MNT'
        Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
        FSSync 'SCRATCH_MNT'
        Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
        At subvol SCRATCH_MNT/mysnap1
        At subvol SCRATCH_MNT/mysnap2
        129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/foo
        42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
        129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo
        At subvol mysnap1
        42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
        At snapshot mysnap2
        129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28e5dd8f
    • Filipe David Borba Manana's avatar
      Btrfs: fix btrfs boot when compiled as built-in · 14a958e6
      Filipe David Borba Manana authored
      After the change titled "Btrfs: add support for inode properties", if
      btrfs was built-in the kernel (i.e. not as a module), it would cause a
      kernel panic, as reported recently by Fengguang:
      
      [    2.024722] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [    2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] PGD 0
      [    2.028684] Oops: 0000 [#1] SMP
      [    2.028684] Modules linked in:
      [    2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
      [    2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [    2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
      [    2.028684] RIP: 0010:[<ffffffff81501594>]  [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] RSP: 0000:ffff88000edd7e58  EFLAGS: 00010246
      [    2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
      [    2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
      [    2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
      [    2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
      [    2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
      [    2.028684] FS:  0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
      [    2.028684] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [    2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
      [    2.028684] Stack:
      [    2.028684]  ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
      [    2.028684]  0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
      [    2.028684]  ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
      [    2.028684] Call Trace:
      [    2.028684]  [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
      [    2.028684]  [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
      [    2.028684]  [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
      [    2.028684]  [<ffffffff8234c785>] ? do_early_param+0x88/0x88
      [    2.028684]  [<ffffffff819f61b5>] ? rest_init+0x89/0x89
      [    2.028684]  [<ffffffff819f61c3>] kernel_init+0xe/0x109
      
      The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
      started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
      part of the properties initialization routine), the libcrc32c is not yet initialized,
      so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
      panic on boot.
      
      The approach to fix this is to use crypto component directly to use its crc32c (which
      is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
      doing as well, it uses crypto directly to get crc32c functionality.
      
      Verified this works both when btrfs is built-in and when it's loadable kernel module.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      14a958e6
    • Filipe David Borba Manana's avatar
      Btrfs: unlock inodes in correct order in clone ioctl · c57c2b3e
      Filipe David Borba Manana authored
      In the clone ioctl, when the source and target inodes are different,
      we can acquire their mutexes in 2 possible different orders. After
      we're done cloning, we were releasing the mutexes always in the same
      order - the most correct way of doing it is to release them by the
      reverse order they were acquired.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c57c2b3e
    • Wang Shilong's avatar
      Btrfs: optimize to remove unnecessary removal with ulist reallocation · f499e40f
      Wang Shilong authored
      Here we are not going to free memory, no need to remove every node
      one by one, just init root node here is ok.
      
      Cc:  Liu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f499e40f
    • Liu Bo's avatar
      Btrfs: release subvolume's block_rsv before transaction commit · de6e8200
      Liu Bo authored
      We don't have to keep subvolume's block_rsv during transaction commit,
      and within transaction commit, we may also need the free space reclaimed
      from this block_rsv to process delayed refs.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      de6e8200
    • Miao Xie's avatar
      Btrfs: fix the race between write back and nocow buffered write · f1de9683
      Miao Xie authored
      When we ran the 274th case of xfstests with nodatacow mount option,
      We met the following warning message:
      WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
      
      It is caused by the race between the write back and nocow buffered
      write:
        Task1				Task2
        __btrfs_buffered_write()
          skip data reservation
          reserve the metadata space
          copy the data
          dirty the pages
          unlock the pages
      				write back the pages
      				release the data space
         				  becasue there is no
      				  noreserve flag
         set the noreserve flag
      
      This patch fixes this problem by unlocking the pages after
      the noreserve flag is set.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f1de9683
    • Josef Bacik's avatar
      Btrfs: only process as many file extents as there are refs · 7ef81ac8
      Josef Bacik authored
      The backref walking code will search down to the key it is looking for and then
      proceed to walk _all_ of the extents on the file until it hits the end.  This is
      suboptimal with large files, we only need to look for as many extents as we have
      references for that inode.  I have a testcase that creates a randomly written 4
      gig file and before this patch it took 6min 30sec to do the initial send, with
      this patch it takes 2min 30sec to do the intial send.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7ef81ac8
    • Josef Bacik's avatar
      Btrfs: fix qgroup rescan to work with skinny metadata · 3a6d75e8
      Josef Bacik authored
      Could have sworn I fixed this before but apparently not.  This makes us pass
      btrfs/022 with skinny metadata enabled.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3a6d75e8
    • Josef Bacik's avatar
      Btrfs: fix extent_from_logical to deal with skinny metadata · 580f0a67
      Josef Bacik authored
      I don't think this is an issue and I've not seen it in practice but
      extent_from_logical will fail to find a skinny extent because it uses
      btrfs_previous_item and gives it the normal extent item type.  This is just not
      a place to use btrfs_previous_item since we care about either normal extents or
      skinny extents, so open code btrfs_previous_item to properly check.  This would
      only affect metadata and the only place this is used for metadata is scrub and
      I'm pretty sure it's just for printing stuff out, not actually doing any work so
      hopefully it was never a problem other than a cosmetic one.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      580f0a67
    • Josef Bacik's avatar
      Btrfs: throttle delayed refs better · 0a2b2a84
      Josef Bacik authored
      On one of our gluster clusters we noticed some pretty big lag spikes.  This
      turned out to be because our transaction commit was taking like 3 minutes to
      complete.  This is because we have like 30 gigs of metadata, so our global
      reserve would end up being the max which is like 512 mb.  So our throttling code
      would allow a ridiculous amount of delayed refs to build up and then they'd all
      get run at transaction commit time, and for a cold mounted file system that
      could take up to 3 minutes to run.  So fix the throttling to be based on both
      the size of the global reserve and how long it takes us to run delayed refs.
      This patch tracks the time it takes to run delayed refs and then only allows 1
      seconds worth of outstanding delayed refs at a time.  This way it will auto-tune
      itself from cold cache up to when everything is in memory and it no longer has
      to go to disk.  This makes our transaction commits take much less time to run.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0a2b2a84
    • Josef Bacik's avatar
      Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79
      Josef Bacik authored
      Currently we have two rb-trees, one for delayed ref heads and one for all of the
      delayed refs, including the delayed ref heads.  When we process the delayed refs
      we have to hold onto the delayed ref lock for all of the selecting and merging
      and such, which results in quite a bit of lock contention.  This was solved by
      having a waitqueue and only one flusher at a time, however this hurts if we get
      a lot of delayed refs queued up.
      
      So instead just have an rb tree for the delayed ref heads, and then attach the
      delayed ref updates to an rb tree that is per delayed ref head.  Then we only
      need to take the delayed ref lock when adding new delayed refs and when
      selecting a delayed ref head to process, all the rest of the time we deal with a
      per delayed ref head lock which will be much less contentious.
      
      The locking rules for this get a little more complicated since we have to lock
      up to 3 things to properly process delayed refs, but I will address that problem
      later.  For now this passes all of xfstests and my overnight stress tests.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d7df2c79
    • Josef Bacik's avatar
      Btrfs: make fsync latency less sucky · 5039eddc
      Josef Bacik authored
      Looking into some performance related issues with large amounts of metadata
      revealed that we can have some pretty huge swings in fsync() performance.  If we
      have a lot of delayed refs backed up (as you will tend to do with lots of
      metadata) fsync() will wander off and try to run some of those delayed refs
      which can result in reading from disk and such.  Since the actual act of fsync()
      doesn't create any delayed refs there is no need to make it throttle on delayed
      ref stuff, that will be handled by other people.  With this patch we get much
      smoother fsync performance with large amounts of metadata.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5039eddc
    • Filipe David Borba Manana's avatar
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana authored
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      63541927
    • Filipe David Borba Manana's avatar
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana authored
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1acae57b
    • Wang Shilong's avatar
      Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() · 90515e7f
      Wang Shilong authored
      We may return early in btrfs_drop_snapshot(), we shouldn't
      call btrfs_std_err() for this case, fix it.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      90515e7f
    • Wang Shilong's avatar
      Btrfs: remove unnecessary transaction commit before send · 8e56338d
      Wang Shilong authored
      We will finish orphan cleanups during snapshot, so we don't
      have to commit transaction here.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8e56338d
    • Wang Shilong's avatar
      Btrfs: fix protection between send and root deletion · 18f687d5
      Wang Shilong authored
      We should gurantee that parent and clone roots can not be destroyed
      during send, for this we have two ideas.
      
      1.by holding @subvol_sem, this might be a nightmare, because it will
      block all subvolumes deletion for a long time.
      
      2.Miao pointed out we can reuse @send_in_progress, that mean we will
      skip snapshot deletion if root sending is in progress.
      
      Here we adopt the second approach since it won't block other subvolumes
      deletion for a long time.
      
      Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
      , if this root is involved in send, we return directly rather than
      continue to check.There are several reasons about it:
      
      1.this case happen seldomly.
      2.after sending,cleaner thread can continue to drop that root.
      3.make code simple
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      18f687d5
    • Wang Shilong's avatar
      Btrfs: fix wrong send_in_progress accounting · 896c14f9
      Wang Shilong authored
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sda8
       # mount /dev/sda8 /mnt
       # btrfs sub snapshot -r /mnt /mnt/snap1
       # btrfs sub snapshot -r /mnt /mnt/snap2
       # btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
       # dmesg
      
      The problem is that we will sort clone roots(include @send_root), it
      might push @send_root before thus @send_root's @send_in_progress will
      be decreased twice.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      896c14f9
    • Qu Wenruo's avatar
      btrfs: Add treelog mount option. · a88998f2
      Qu Wenruo authored
      Add treelog mount option to enable tree log with
      remount option.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a88998f2
    • Qu Wenruo's avatar
      btrfs: Add datasum mount option. · d399167d
      Qu Wenruo authored
      Add datasum mount option to enable checksum with
      remount option.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d399167d
    • Qu Wenruo's avatar
      btrfs: Add datacow mount option. · a258af7a
      Qu Wenruo authored
      Add datacow mount option to enable copy-on-write with
      remount option.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a258af7a
    • Qu Wenruo's avatar
      btrfs: Add acl mount option. · bd0330ad
      Qu Wenruo authored
      Add acl mount option to enable acl with remount option.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bd0330ad
    • Qu Wenruo's avatar
      btrfs: Add noflushoncommit mount option. · 2c9ee856
      Qu Wenruo authored
      Add noflushoncommit mount option to disable flush on commit with
      remount option.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2c9ee856
    • Qu Wenruo's avatar
      btrfs: Add noenospc_debug mount option. · 53036293
      Qu Wenruo authored
      Add noenospc_debug mount option to disable ENOSPC debug with
      remount option.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      53036293
    • Qu Wenruo's avatar
      btrfs: Add nodiscard mount option. · e07a2ade
      Qu Wenruo authored
      Add nodiscard mount option to disable discard with remount option.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e07a2ade
    • Qu Wenruo's avatar
      btrfs: Add noautodefrag mount option. · fc0ca9af
      Qu Wenruo authored
      Btrfs has autodefrag mount option but no pairing noautodefrag option,
      which makes it impossible to disable autodefrag without umount.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fc0ca9af
    • Qu Wenruo's avatar
      btrfs: Add "barrier" option to support "-o remount,barrier" · 842bef58
      Qu Wenruo authored
      Btrfs can be remounted without barrier, but there is no "barrier" option
      so nobody can remount btrfs back with barrier on. Only umount and
      mount again can re-enable barrier.(Quite awkward)
      
      Also the mount options in the document is also changed slightly for the
      further pairing options changes.
      Reported-by: default avatarDaniel Blueman <daniel@quora.org>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarMike Fleetwood <mike.fleetwood@googlemail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      842bef58
    • Wang Shilong's avatar
      Btrfs: only fua the first superblock when writting supers · e8117c26
      Wang Shilong authored
      We only intent to fua the first superblock in every device from
      comments, fix it.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e8117c26
    • Liu Bo's avatar
      Btrfs: return free space to global_rsv as much as possible · 17504584
      Liu Bo authored
      @full is not protected within global_rsv.lock, so we may think global_rsv
      is already full but in fact it's not, so we miss the opportunity to return
      free space to global_rsv directly when we release other block_rsvs.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      17504584
    • Wang Shilong's avatar
      Btrfs: fix an oops when we fail to relocate tree blocks · 1708cc57
      Wang Shilong authored
      During balance test, we hit an oops:
      [ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174!
      
      The problem is that if we fail to relocate tree blocks, we should
      update backref cache, otherwise, some pending nodes are not updated
      while snapshot check @cache->last_trans is within one transaction
      and won't update it and then oops happen.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1708cc57
    • Miao Xie's avatar
      Btrfs: fix the wrong nocow range check · e77751aa
      Miao Xie authored
      The following warning message was outputed when running the 274th case
      of xfstests with nodatacow option:
       BUG: Bad page state in process kswapd0  pfn:1c66f
       page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000
       page flags: 0x1000000000100a(error|uptodate|private_2)
      
      It is because the check of nocow range was wrong, we should compare the
      start and end position of the extent with the write position to verify
      if the write position was in the extent, but the current code just used
      the start postion to do the check, so we got the wrong extent and told
      the caller that it was a nocow write. And then when we write back the
      dirty pages, we found we should cow the extent, but at that time, there
      was no space in the fs, we had to the error flag for the page. When
      someone reclaimed that page, the above warning outputed. Fix it.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e77751aa
    • Wang Shilong's avatar
      Btrfs: fix an oops when we fail to merge reloc roots · 25e293c2
      Wang Shilong authored
      Previously, we will free reloc root memory and then force filesystem
      to be readonly. The problem is that there may be another thread commiting
      transaction which will try to access freed reloc root during merging reloc
      roots process.
      
      To keep consistency snapshots shared space, we should allow snapshot
      finished if possible, so here we don't free reloc root memory.
      signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      25e293c2
    • Wang Shilong's avatar
      Btrfs: remove unused argument from select_reloc_root() · dc4103f9
      Wang Shilong authored
      @nr is no longer used, remove it from select_reloc_root()
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dc4103f9
    • Filipe David Borba Manana's avatar
      Btrfs: reduce btree node locking duration on item update · eb653de1
      Filipe David Borba Manana authored
      If we do a btree search with the goal of updating an existing item
      without changing its size (ins_len == 0 and cow == 1), then we never
      need to hold locks on upper level nodes (even when slot == 0) after we
      COW their child nodes/leaves, as we won't have node splits or merges
      in this scenario (that is, no key additions, removals or shifts on any
      nodes or leaves).
      
      Therefore release the locks immediately after COWing the child nodes/leaves
      while navigating the btree, even if their parent slot is 0, instead of
      returning a path to the caller with those nodes locked, which would get
      released only when the caller releases or frees the path (or if it calls
      btrfs_unlock_up_safe).
      
      This is a common scenario, for example when updating inode items in fs
      trees and block group items in the extent tree.
      
      The following benchmarks were performed on a quad core machine with 32Gb
      of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
      quickly).
      
        sysbench --test=fileio --file-num=131072 --file-total-size=8G \
          --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
          --max-requests=100000 --file-io-mode=sync [prepare|run]
      
      Before this change:  49.85Mb/s (average of 5 runs)
      After this change:   50.38Mb/s (average of 5 runs)
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      eb653de1
    • Wenliang Fan's avatar
      fs/btrfs: Integer overflow in btrfs_ioctl_resize() · eb8052e0
      Wenliang Fan authored
      The local variable 'new_size' comes from userspace. If a large number
      was passed, there would be an integer overflow in the following line:
      	new_size = old_size + new_size;
      Signed-off-by: default avatarWenliang Fan <fanwlexca@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      eb8052e0
    • Josef Bacik's avatar
      Btrfs: stop caching thread if extent_commit_sem is contended · c9ea7b24
      Josef Bacik authored
      We can starve out the transaction commit with a bunch of caching threads all
      running at the same time.  This is because we will only drop the
      extent_commit_sem if we need_resched(), which isn't likely to happen since we
      will be reading a lot from the disk so have already schedule()'ed plenty.  Alex
      observed that he could starve out a transaction commit for up to a minute with
      32 caching threads all running at once.  This will allow us to drop the
      extent_commit_sem to allow the transaction commit to swap the commit_root out
      and then all the cachers will start back up. Here is an explanation provided by
      Igno
      
      So, just to fill in what happens in this loop:
      
                                      mutex_unlock(&caching_ctl->mutex);
                                      cond_resched();
                                      goto again;
      
      where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
      again:
      
              again:
                      mutex_lock(&caching_ctl->mutex);
                      /* need to make sure the commit_root doesn't disappear */
                      down_read(&fs_info->extent_commit_sem);
      
      So, if I'm reading the code correct, there can be a fair amount of
      concurrency here: there may be multiple 'caching kthreads' per filesystem
      active, while there's one fs_info->extent_commit_sem per filesystem
      AFAICS.
      
      So, what happens if there are a lot of CPUs all busy holding the
      ->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
      rush to try to release the fs_info->extent_commit_sem, and they'd block in
      the down_read() because there's a writer waiting.
      
      So there's a guarantee of forward progress. This should answer akpm's
      concern I think.
      
      Thanks,
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c9ea7b24
    • Josef Bacik's avatar
      rwsem: add rwsem_is_contended · 4a444b1f
      Josef Bacik authored
      Btrfs needs a simple way to know if it needs to let go of it's read lock on a
      rwsem.  Introduce rwsem_is_contended to check to see if there are any waiters on
      this rwsem currently.  This is just a hueristic, it is meant to be light and not
      100% accurate and called by somebody already holding on to the rwsem in either
      read or write.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      4a444b1f
    • Miao Xie's avatar
      Btrfs: introduce the delayed inode ref deletion for the single link inode · 67de1176
      Miao Xie authored
      The inode reference item is close to inode item, so we insert it simultaneously
      with the inode item insertion when we create a file/directory.. In fact, we also
      can handle the inode reference deletion by the same way. So we made this patch to
      introduce the delayed inode reference deletion for the single link inode(At most
      case, the file doesn't has hard link, so we don't take the hard link into account).
      
      This function is based on the delayed inode mechanism. After applying this patch,
      we can reduce the time of the file/directory deletion by ~10%.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      67de1176