1. 04 Oct, 2012 11 commits
  2. 01 Oct, 2012 29 commits
    • Miao Xie's avatar
      Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" · 90abccf2
      Miao Xie authored
      This reverts commit 0885ef5b
      
      After applying the above patch, the performance slowed down because the dirty
      page flush can only be done by one task, so revert it.
      
      The following is the test result of sysbench:
      	Before		After
      	24MB/s		39MB/s
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      90abccf2
    • Josef Bacik's avatar
      Btrfs: remove bytes argument from do_chunk_alloc · 698d0082
      Josef Bacik authored
      Everybody is just making stuff up, and it's just used to see if we really do
      need to alloc a chunk, and since we do this when we already know we really
      do it's just a waste of space.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      698d0082
    • Josef Bacik's avatar
      Btrfs: delay block group item insertion · ea658bad
      Josef Bacik authored
      So we have lots of places where we try to preallocate chunks in order to
      make sure we have enough space as we make our allocations.  This has
      historically meant that we're constantly tweaking when we should allocate a
      new chunk, and historically we have gotten this horribly wrong so we way
      over allocate either metadata or data.  To try and keep this from happening
      we are going to make it so that the block group item insertion is done out
      of band at the end of a transaction.  This will allow us to create chunks
      even if we are trying to make an allocation for the extent tree.  With this
      patch my enospc tests run faster (didn't expect this) and more efficiently
      use the disk space (this is what I wanted).  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      ea658bad
    • Kent Overstreet's avatar
      btrfs: Kill some bi_idx references · be3940c0
      Kent Overstreet authored
      For immutable bio vecs, I've been auditing and removing bi_idx
      references. These were harmless, but removing them will make auditing
      easier.
      
      scrub_bio_end_io_worker() was open coding a bio_reset() - but this
      doesn't appear to have been needed for anything as right after it does a
      bio_put(), and perusing the code it doesn't appear anything else was
      holding a reference to the bio.
      
      The other use end_bio_extent_readpage() was just for a pr_debug() -
      changed it to something that might be a bit more useful.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Chris Mason <chris.mason@oracle.com>
      CC: Stefan Behrens <sbehrens@giantdisaster.de>
      be3940c0
    • Miao Xie's avatar
      Btrfs: fix unnecessary warning when the fragments make the space alloc fail · 962197ba
      Miao Xie authored
      When we wrote some data by compress mode into a btrfs filesystem which was full
      of the fragments, the kernel will report:
      	BTRFS warning (device xxx): Aborting unused transaction.
      
      The reason is:
      We can not find a long enough free space to store the compressed data because
      of the fragmentary free space, and the compressed data can not be splited,
      so the kernel outputed the above message.
      
      In fact, btrfs can deal with this problem very well: it fall back to
      uncompressed IO, split the uncompressed data into small ones, and then
      store them into to the fragmentary free space. So we shouldn't output the
      above warning message.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      962197ba
    • Josef Bacik's avatar
      Btrfs: create a pinned em when writing to a prealloc range in DIO · 69ffb543
      Josef Bacik authored
      Wade Cline reported a problem where he was getting garbage and warnings when
      writing to a preallocated range via O_DIRECT.  This is because we weren't
      creating our normal pinned extent_map for the range we were writing to,
      which was causing all sorts of issues.  This patch fixes the problem and
      makes his testcase much happier.  Thanks,
      Reported-by: default avatarWade Cline <clinew@linux.vnet.ibm.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      69ffb543
    • Josef Bacik's avatar
      Btrfs: move the sb_end_intwrite until after the throttle logic · 6df7881a
      Josef Bacik authored
      Sage reported the following lockdep backtrace
      
      =====================================
      [ BUG: bad unlock balance detected! ]
      3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
      -------------------------------------
      btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
      [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
      but there are no more locks to release!
      
      other info that might help us debug this:
      1 lock held by btrfs-cleaner/7607:
       #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]
      
      stack backtrace:
      Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
      Call Trace:
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
       [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
       [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
       [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810b2a95>] lock_release+0xd5/0x220
       [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
       [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
       [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
       [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
       [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
       [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
       [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
       [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
       [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
       [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
       [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
       [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
       [<ffffffff810791ee>] kthread+0xae/0xc0
       [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
       [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
       [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
       [<ffffffff8163e740>] ? gs_change+0x13/0x13
      
      This is because the throttle stuff can commit the transaction, which expects to
      be the one stopping the intwrite stuff, but we've already done it in the
      __btrfs_end_transaction.  Moving the sb_end_intewrite after this logic makes the
      lockdep go away.  Thanks,
      Tested-by: default avatarSage Weil <sage@inktank.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      6df7881a
    • Liu Bo's avatar
      Btrfs: use larger limit for translation of logical to inode · 425d17a2
      Liu Bo authored
      This is the change of the kernel side.
      
      Translation of logical to inode used to have an upper limit 4k on
      inode container's size, but the limit is not large enough for a data
      with a great many of refs, so when resolving logical address,
      we can end up with
      "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"
      
      This changes to regard 64k as the upper limit and use vmalloc instead of
      kmalloc to get memory more easily.
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      425d17a2
    • Liu Bo's avatar
      Btrfs: use helper for logical resolve · df031f07
      Liu Bo authored
      We already have a helper, iterate_inodes_from_logical(), for logical resolve,
      so just use it.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      df031f07
    • Liu Bo's avatar
      Btrfs: fix a bug in parsing return value in logical resolve · 69917e43
      Liu Bo authored
      In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.
      
      It is possible to lose our errors because
      (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.
      
      I'm not sure if it is on purpose, it just looks too hacky if it is.
      I'd rather use a real flag and a 'ret' to catch errors.
      Acked-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarLiu Bo <liub.liubo@gmail.com>
      69917e43
    • Liu Bo's avatar
      Btrfs: update delayed ref's tracepoints to show sequence · dea7d76e
      Liu Bo authored
      We've added a new field 'sequence' to delayed ref node, so update related
      tracepoints.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      dea7d76e
    • liubo's avatar
      Btrfs: cleanup for unused ref cache stuff · 0647d6bd
      liubo authored
      As ref cache has been removed from btrfs, there is no user on
      its lock and its check.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      0647d6bd
    • Miao Xie's avatar
      Btrfs: fix corrupted metadata in the snapshot · 8407aa46
      Miao Xie authored
      When we delete a inode, we will remove all the delayed items including delayed
      inode update, and then truncate all the relative metadata. If there is lots of
      metadata, we will end the current transaction, and start a new transaction to
      truncate the left metadata. In this way, we will leave a inode item that its
      link counter is > 0, and also may leave some directory index items in fs/file tree
      after the current transaction ends. In other words, the metadata in this fs/file tree
      is inconsistent. If we create a snapshot for this tree now, we will find a inode with
      corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
      because its link counter is not 0.
      
      We fix this problem by updating the inode item before the current transaction ends.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      8407aa46
    • David Sterba's avatar
      btrfs: polish names of kmem caches · 837e1972
      David Sterba authored
      Usecase:
      
        watch 'grep btrfs < /proc/slabinfo'
      
      easy to watch all caches in one go.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      837e1972
    • Josef Bacik's avatar
      Btrfs: fix our overcommit math · a80c8dcf
      Josef Bacik authored
      I noticed I was seeing large lags when running my torrent test in a vm on my
      laptop.  While trying to make it lag less I noticed that our overcommit math
      was taking into account the number of bytes we wanted to reclaim, not the
      number of bytes we actually wanted to allocate, which means we wouldn't
      overcommit as often.  This patch fixes the overcommit math and makes
      shrink_delalloc() use that logic so that it will stop looping faster.  We
      still have pretty high spikes of latency, but the test now takes 3 minutes
      less time (about 5% faster).  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      a80c8dcf
    • Josef Bacik's avatar
      Btrfs: wait on async pages when shrinking delalloc · dea31f52
      Josef Bacik authored
      Mitch reported a problem where you could get an ENOSPC error when untarring
      a kernel git tree onto a 16gb file system with compress-force=zlib.  This is
      because compression is a huge pain, it will return from ->writepages()
      without having actually created any ordered extents.  To get around this we
      check to see if the async submit counter is up, and if it is wait until it
      drops to 0 before doing our normal ordered wait dance.  With this patch I
      can now untar a kernel git tree onto a 16gb file system without getting
      ENOSPC errors.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      dea31f52
    • Liu Bo's avatar
      Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag · 9e8a4a8b
      Liu Bo authored
      We're going to use this flag EXTENT_DEFRAG to indicate which range
      belongs to defragment so that we can implement snapshow-aware defrag:
      
      We set the EXTENT_DEFRAG flag when dirtying the extents that need
      defragmented, so later on writeback thread can differentiate between
      normal writeback and writeback started by defragmentation.
      Original-Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      9e8a4a8b
    • Tsutomu Itoh's avatar
      Btrfs: check return value of ulist_alloc() properly · 3d6b5c3b
      Tsutomu Itoh authored
      ulist_alloc() has the possibility of returning NULL.
      So, it is necessary to check the return value.
      Signed-off-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      3d6b5c3b
    • Tsutomu Itoh's avatar
      Btrfs: fix error handling in delete_block_group_cache() · f54fb859
      Tsutomu Itoh authored
      btrfs_iget() never return NULL.
      So, NULL check is unnecessary.
      Signed-off-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      f54fb859
    • Miao Xie's avatar
      Btrfs: fix wrong size for the reservation when doing, file pre-allocation. · 903889f4
      Miao Xie authored
      When we ran fsstress(a program in xfstests), the filesystem hung up when it
      is full. It was because the space reserved in btrfs_fallocate() was wrong,
      btrfs_fallocate() just used the size of the pre-allocation to reserve the
      space, didn't took the block size aligning into account, so the size of
      the reserved space was less than the allocated space, it caused the over
      reserve problem and made the filesystem hung up when invoking cow_file_range().
      Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      903889f4
    • Miao Xie's avatar
      Btrfs: output more information when aborting a unused transaction handle · 69ce977a
      Miao Xie authored
      Though we dump the stack information when aborting a unused transaction
      handle, we don't know the correct place where we decide to abort the
      transaction handle if one function has several place where the transaction
      abort function is invoked and jumps to the same place after this call.
      And beside that we also don't know the reason why we jump to abort
      the current handle. So I modify the transaction abort function and make
      it output the function name, line and error information.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      69ce977a
    • Miao Xie's avatar
      Btrfs: fix unprotected ->log_batch · 2ecb7923
      Miao Xie authored
      We forget to protect ->log_batch when syncing a file, this patch fix
      this problem by atomic operation. And ->log_batch is used to check
      if there are parallel sync operations or not, so it is unnecessary to
      reset it to 0 after the sync operation of the current log tree complete.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      2ecb7923
    • Miao Xie's avatar
      Btrfs: fix wrong size for the reservation of the, snapshot creation · 48c03c4b
      Miao Xie authored
      We should insert/update 6 items(root ref, root backref, dir item, dir index,
      root item and parent inode) when creating a snapshot, not 5 items, fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      48c03c4b
    • Miao Xie's avatar
      Btrfs: fix the snapshot that should not exist · 42874b3d
      Miao Xie authored
      The snapshot should be the image of the fs tree before it was created,
      so the metadata of the snapshot should not exist in the its tree. But now, we
      found the directory item and directory name index is in both the snapshot tree
      and the fs tree. It introduces some problems and makes the users feel strange:
      
       # mkfs.btrfs /dev/sda1
       # mount /dev/sda1 /mnt
       # mkdir /mnt/1
       # cd /mnt/1
       # btrfs subvolume snapshot /mnt snap0
       # ls -a /mnt/1/snap0/1
       .	..	[no other file/dir]
      
       # ll /mnt/1/snap0/
       total 0
       drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
      			^^^
      			There is no file/dir in it, but it's size is 10
      
       # cd /mnt/1/snap0/1/snap0
       [Enter a unexisted directory successfully...]
      
      There is nothing in the directory 1 in snap0, but btrfs told the length of
      this directory is 10. Beside that, we can enter an unexisted directory, it is
      very strange to the users.
      
       # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
       # ll /mnt/1/snap0/1/
       total 0
       [None]
       # ll /mnt/snap1/1/
       total 0
       drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0
      
      And the source of snap1 did have any directory in Directory 1, but snap1 have
      a snap0, it is different between the source and the snapshot.
      
      So I think we should insert directory item and directory name index and update
      the parent inode as the last step of snapshot creation, and do not leave the
      useless metadata in the file tree.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      42874b3d
    • Miao Xie's avatar
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie authored
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • Miao Xie's avatar
      Btrfs: use a slab for ordered extents allocation · 6352b91d
      Miao Xie authored
      The ordered extent allocation is in the fast path of the IO, so use a slab
      to improve the speed of the allocation.
      
       "Size of the struct is 280, so this will fall into the size-512 bucket,
        giving 8 objects per page, while own slab will pack 14 objects into a page.
      
        Another benefit I see is to check for leaked objects when the module is
        removed (and the cache destroy takes place)."
      						-- David Sterba
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      6352b91d
    • Miao Xie's avatar
      Btrfs: fix file extent discount problem in the, snapshot · b9a8cc5b
      Miao Xie authored
      If a snapshot is created while we are writing some data into the file,
      the i_size of the corresponding file in the snapshot will be wrong, it will
      be beyond the end of the last file extent. And btrfsck will report:
        root 256 inode 257 errors 100
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
       # for ((i=0; i<4; i++))
       > do
       > btrfs sub snap . $i
       > done
      
      This because the algorithm of disk_i_size update is wrong. Though there are
      some ordered extents behind the current one which we use to update disk_i_size,
      it doesn't mean those extents will be dealt with in the same transaction. So
      We shouldn't use the offset of those extents to update disk_i_size. Or we will
      get the wrong i_size in the snapshot.
      
      We fix this problem by recording the max real i_size. If we find there is a
      ordered extent which is in front of the current one and doesn't complete, we
      will record the end of the current one into that ordered extent. Surely, if
      the current extent holds the end of other extent(it must be greater than
      the current one because it is behind the current one), we will record the
      number that the current extent holds. In this way, we can exclude the ordered
      extents that may not be dealth with in the same transaction, and be easy to
      know the real disk_i_size.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      b9a8cc5b
    • Miao Xie's avatar
      Btrfs: fix full backref problem when inserting shared block reference · 361048f5
      Miao Xie authored
      If we create several snapshots at the same time, the following BUG_ON() will be
      triggered.
      
      	kernel BUG at fs/btrfs/extent-tree.c:6047!
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
       # for ((i=0; i<4; i++))
       > do
       > mkdir $i
       > for ((j=0; j<200; j++))
       > do
       > btrfs sub snap . $i/$j
       > done &
       > done
      
      The reason is:
      Before transaction commit, some operations changed the fs tree and new tree
      blocks were allocated because of COW. We used the implicit non-shared back
      reference for those newly allocated tree blocks because they were not shared by
      two or more trees.
      
      And then we created the first snapshot for the fs tree, according to the back
      reference rules, we also used implicit back refs for the child tree blocks of
      the root node of the fs tree, now those child nodes/leaves were shared by two
      trees.
      
      Then We didn't deal with the delayed references, and continued to change the fs
      tree(created the second snapshot and inserted the dir item of the new snapshot
      into the fs tree). According to the rules of the back reference, we added full
      back refs for those tree blocks whose parents have be shared by two trees.
      Now some newly allocated tree blocks had two types of the references.
      
      As we know, the delayed reference system handles these delayed references from
      back to front, and the full delayed reference is inserted after the implicit
      ones. So when we dealt with the back references of those newly allocated tree
      blocks, the full references was dealt with at first. And if the first reference
      is a shared back reference and the tree block that the reference points to is
      newly allocated, It would be considered as a tree block which is shared by two
      or more trees when it is allocated and should be a full back reference not a
      implicit one, the flag of its reference also should be set to FULL_BACKREF.
      But in fact, it was a non-shared tree block with a implicit reference at
      beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
      was triggered.
      
      We have several methods to fix this bug:
      1. deal with delayed references after the snapshot is created and before we
         change the source tree of the snapshot. This is the easiest and safest way.
      2. modify the sort method of the delayed reference tree, make the full delayed
         references be inserted before the implicit ones. It is also very easy, but
         I don't know if it will introduce some problems or not.
      3. modify select_delayed_ref() and make it select the implicit delayed reference
         at first. This way is not so good because it may wastes CPU time if we have
         lots of delayed references.
      4. set the flags to FULL_BACKREF, this method is a little complex comparing with
         the 1st way.
      
      I chose the 1st way to fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      361048f5
    • Miao Xie's avatar
      Btrfs: fix error path in create_pending_snapshot() · 6fa9700e
      Miao Xie authored
      This patch fixes the following problem:
      - If we failed to deal with the delayed dir items, we should abort transaction,
        just as its comment said. Fix it.
      - If root reference or root back reference insertion failed, we should
        abort transaction. Fix it.
      - Fix the double free problem of pending->inherit.
      - Do not restore the trans->rsv if we doesn't change it.
      - make the error path more clearly.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      6fa9700e