An error occurred fetching the project authors.
  1. 29 Apr, 2019 3 commits
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref() · 76675593
      Qu Wenruo authored
      Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
      btrfs_add_delayed_data_ref().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76675593
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref() · ed4f255b
      Qu Wenruo authored
      btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
      some callers like btrfs_inc_extent_ref() are using @owner as level for
      delayed tree ref.
      
      Instead of making the parameter list longer, use btrfs_ref to refactor
      it, so each parameter assignment should be self-explaining without dirty
      level/owner trick, and provides the basis for later refactoring.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed4f255b
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Introduce better documented delayed ref structures · b28b1f0c
      Qu Wenruo authored
      Current delayed ref interface has several problems:
      
      - Longer and longer parameter lists
        bytenr
        num_bytes
        parent
        ---------- so far so good
        ref_root
        owner
        offset
        ---------- I don't feel good now
      
      - Different interpretation of the same parameter
      
        Above @owner for data ref is inode number (u64),
        while for tree ref, it's level (int).
      
        They are even in different size range.
        For level we only need 0 ~ 8, while for ino it's
        BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID.
      
        And @offset doesn't even make sense for tree ref.
      
        Such parameter reuse may look clever as an hidden union, but it
        destroys code readability.
      
      To solve both problems, we introduce a new structure, btrfs_ref to solve
      them:
      
      - Structure instead of long parameter list
        This makes later expansion easier, and is better documented.
      
      - Use btrfs_ref::type to distinguish data and tree ref
      
      - Use proper union to store data/tree ref specific structures.
      
      - Use separate functions to fill data/tree ref data, with a common generic
        function to fill common bytenr/num_bytes members.
      
      All parameters will find its place in btrfs_ref, and an extra member,
      @real_root, inspired by ref-verify code, is newly introduced for later
      qgroup code, to record which tree is triggered by this extent modification.
      
      This patch doesn't touch any code, but provides the basis for further
      refactoring.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b28b1f0c
  2. 25 Feb, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
      Qu Wenruo authored
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
      
      [BUG]
      Btrfs/139 will fail with a high probability if the testing machine (VM)
      has only 2G RAM.
      
      Resulting the final write success while it should fail due to EDQUOT,
      and the fs will have quota exceeding the limit by 16K.
      
      The simplified reproducer will be: (needs a 2G ram VM)
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ for i in $(seq -w  1 8); do
        	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
        	echo "file $i written" > /dev/kmsg
          done
        $ sync
        $ btrfs qgroup show -pcre --raw $mnt
      
      The last pwrite will not trigger EDQUOT and final 'qgroup show' will
      show something like:
      
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5             16384        16384         none         none ---     ---
        0/256      1073758208   1073758208         none   1073741824 ---     ---
      
      And 1073758208 is larger than
        > 1073741824.
      
      [CAUSE]
      It's a bug in btrfs qgroup data reserved space management.
      
      For quota limit, we must ensure that:
        reserved (data + metadata) + rfer/excl <= limit
      
      Since rfer/excl is only updated at transaction commmit time, reserved
      space needs to be taken special care.
      
      One important part of reserved space is data, and for a new data extent
      written to disk, we still need to take the reserved space until
      rfer/excl numbers get updated.
      
      Originally when an ordered extent finishes, we migrate the reserved
      qgroup data space from extent_io tree to delayed ref head of the data
      extent, expecting delayed ref will only be cleaned up at commit
      transaction time.
      
      However for small RAM machine, due to memory pressure dirty pages can be
      flushed back to disk without committing a transaction.
      
      The related events will be something like:
      
        file 1 written
        btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
        btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
        btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
        btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
        btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
        cleanup_ref_head: num_bytes=54947840
        cleanup_ref_head: num_bytes=5636096
        cleanup_ref_head: num_bytes=569344
        cleanup_ref_head: num_bytes=57344
        cleanup_ref_head: num_bytes=8192
        ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
        file 2 written
        ...
        file 8 written
        cleanup_ref_head: num_bytes=8192
        ...
        btrfs_commit_transaction  <<< the only transaction committed during
      				the test
      
      When file 2 is written, we have already freed 128M reserved qgroup data
      space for ino 258. Thus later write won't trigger EDQUOT.
      
      This allows us to write more data beyond qgroup limit.
      
      In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
      
      [FIX]
      By moving reserved qgroup data space from btrfs_delayed_ref_head to
      btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
      space won't be freed half way before commit transaction, thus fix the
      problem.
      
      Fixes: f64d5ca8 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1418bae1
  3. 17 Dec, 2018 1 commit
  4. 15 Oct, 2018 4 commits
  5. 06 Aug, 2018 2 commits
  6. 28 May, 2018 2 commits
  7. 20 Apr, 2018 1 commit
    • Nikolay Borisov's avatar
      btrfs: Fix race condition between delayed refs and blockgroup removal · 5e388e95
      Nikolay Borisov authored
      When the delayed refs for a head are all run, eventually
      cleanup_ref_head is called which (in case of deletion) obtains a
      reference for the relevant btrfs_space_info struct by querying the bg
      for the range. This is problematic because when the last extent of a
      bg is deleted a race window emerges between removal of that bg and the
      subsequent invocation of cleanup_ref_head. This can result in cache being null
      and either a null pointer dereference or assertion failure.
      
      	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
      	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
      	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
      	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
      	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
      	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
      	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
      	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
      	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
      	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
      	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
      	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
      	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
      	 evict+0xc6/0x190
      	 do_unlinkat+0x19c/0x300
      	 do_syscall_64+0x74/0x140
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7fbf589c57a7
      
      To fix this, introduce a new flag "is_system" to head_ref structs,
      which is populated at insertion time. This allows to decouple the
      querying for the spaceinfo from querying the possibly deleted bg.
      
      Fixes: d7eae340 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
      CC: stable@vger.kernel.org # 4.14+
      Suggested-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e388e95
  8. 12 Apr, 2018 1 commit
  9. 26 Mar, 2018 1 commit
    • David Sterba's avatar
      btrfs: add more __cold annotations · e67c718b
      David Sterba authored
      The __cold functions are placed to a special section, as they're
      expected to be called rarely. This could help i-cache prefetches or help
      compiler to decide which branches are more/less likely to be taken
      without any other annotations needed.
      
      Though we can't add more __exit annotations, it's still possible to add
      __cold (that's also added with __exit). That way the following function
      categories are tagged:
      
      - printf wrappers, error messages
      - exit helpers
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e67c718b
  10. 22 Jan, 2018 1 commit
  11. 01 Nov, 2017 1 commit
    • Josef Bacik's avatar
      btrfs: track refs in a rb_tree instead of a list · 0e0adbcf
      Josef Bacik authored
      If we get a significant amount of delayed refs for a single block (think
      modifying multiple snapshots) we can end up spending an ungodly amount
      of time looping through all of the entries trying to see if they can be
      merged.  This is because we only add them to a list, so we have O(2n)
      for every ref head.  This doesn't make any sense as we likely have refs
      for different roots, and so they cannot be merged.  Tracking in a tree
      will allow us to break as soon as we hit an entry that doesn't match,
      making our worst case O(n).
      
      With this we can also merge entries more easily.  Before we had to hope
      that matching refs were on the ends of our list, but with the tree we
      can search down to exact matches and merge them at insert time.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e0adbcf
  12. 30 Oct, 2017 1 commit
  13. 29 Jun, 2017 1 commit
  14. 18 Apr, 2017 1 commit
  15. 14 Feb, 2017 2 commits
  16. 30 Nov, 2016 1 commit
    • Wang Xiaoguang's avatar
      btrfs: improve delayed refs iterations · 1d57ee94
      Wang Xiaoguang authored
      This issue was found when I tried to delete a heavily reflinked file,
      when deleting such files, other transaction operation will not have a
      chance to make progress, for example, start_transaction() will blocked
      in wait_current_trans(root) for long time, sometimes it even triggers
      soft lockups, and the time taken to delete such heavily reflinked file
      is also very large, often hundreds of seconds. Using perf top, it reports
      that:
      
      PerfTop:    7416 irqs/sec  kernel:99.8%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
      ---------------------------------------------------------------------------------------
          84.37%  [btrfs]             [k] __btrfs_run_delayed_refs.constprop.80
          11.02%  [kernel]            [k] delay_tsc
           0.79%  [kernel]            [k] _raw_spin_unlock_irq
           0.78%  [kernel]            [k] _raw_spin_unlock_irqrestore
           0.45%  [kernel]            [k] do_raw_spin_lock
           0.18%  [kernel]            [k] __slab_alloc
      It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
      work, I found it's select_delayed_ref() causing this issue, for a delayed
      head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
      select_delayed_ref() will firstly try to iterate node list to find
      BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
      waste much time.
      
      To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
      then in select_delayed_ref(), if this list is not empty, we can directly use
      nodes in this list. With this patch, it just took about 10~15 seconds to
      delte the same file. Now using perf top, it reports that:
      
      PerfTop:    2734 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
      ----------------------------------------------------------------------------------------
      
          20.74%  [kernel]          [k] _raw_spin_unlock_irqrestore
          16.33%  [kernel]          [k] __slab_alloc
           5.41%  [kernel]          [k] lock_acquired
           4.42%  [kernel]          [k] lock_acquire
           4.05%  [kernel]          [k] lock_release
           3.37%  [kernel]          [k] _raw_spin_unlock_irq
      
      For normal files, this patch also gives help, at least we do not need to
      iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d57ee94
  17. 19 Nov, 2016 1 commit
  18. 03 Aug, 2016 1 commit
  19. 25 May, 2016 1 commit
  20. 07 Jan, 2016 1 commit
    • David Sterba's avatar
      btrfs: better packing of btrfs_delayed_extent_op · 35b3ad50
      David Sterba authored
      btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
      and has 8 unused bytes. Reducing the level type to u8 makes it possible
      to squeeze it to the padding byte after key. The bitfields were switched
      to bool as there's space to store the full byte without increasing the
      whole structure, besides that the generated assembly is smaller.
      
      struct btrfs_delayed_extent_op {
      	struct btrfs_disk_key      key;                  /*     0    17 */
      	u8                         level;                /*    17     1 */
      	bool                       update_key;           /*    18     1 */
      	bool                       update_flags;         /*    19     1 */
      	bool                       is_data;              /*    20     1 */
      
      	/* XXX 3 bytes hole, try to pack */
      
      	u64                        flags_to_set;         /*    24     8 */
      
      	/* size: 32, cachelines: 1, members: 6 */
      	/* sum members: 29, holes: 1, sum holes: 3 */
      	/* last cacheline: 32 bytes */
      };
      
      The final size is 32 bytes which gives +26 object per slab page.
      
         text	   data	    bss	    dec	    hex	filename
       938811	  43670	  23144	1005625	  f5839	fs/btrfs/btrfs.ko.before
       938747	  43670	  23144	1005561	  f57f9	fs/btrfs/btrfs.ko.after
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35b3ad50
  21. 27 Oct, 2015 1 commit
  22. 25 Oct, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana authored
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarElias Probst <mail@eliasprobst.eu>
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
  23. 22 Oct, 2015 1 commit
  24. 10 Jun, 2015 3 commits
  25. 10 Apr, 2015 1 commit
    • Josef Bacik's avatar
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik authored
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1262133b
  26. 10 Jun, 2014 1 commit
    • Josef Bacik's avatar
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik authored
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fcebe456
  27. 28 Jan, 2014 2 commits
    • Josef Bacik's avatar
      Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79
      Josef Bacik authored
      Currently we have two rb-trees, one for delayed ref heads and one for all of the
      delayed refs, including the delayed ref heads.  When we process the delayed refs
      we have to hold onto the delayed ref lock for all of the selecting and merging
      and such, which results in quite a bit of lock contention.  This was solved by
      having a waitqueue and only one flusher at a time, however this hurts if we get
      a lot of delayed refs queued up.
      
      So instead just have an rb tree for the delayed ref heads, and then attach the
      delayed ref updates to an rb tree that is per delayed ref head.  Then we only
      need to take the delayed ref lock when adding new delayed refs and when
      selecting a delayed ref head to process, all the rest of the time we deal with a
      per delayed ref head lock which will be much less contentious.
      
      The locking rules for this get a little more complicated since we have to lock
      up to 3 things to properly process delayed refs, but I will address that problem
      later.  For now this passes all of xfstests and my overnight stress tests.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d7df2c79
    • Liu Bo's avatar
      Btrfs: introduce a head ref rbtree · c46effa6
      Liu Bo authored
      The way how we process delayed refs is
      1) get a bunch of head refs,
      2) pick up one head ref,
      3) go one node back for any delayed ref updates.
      
      The head ref is also linked in the same rbtree as the delayed ref is,
      so in 1) stage, we have to walk one by one including not only head refs, but
      delayed refs.
      
      When we have a great number of delayed refs pending to process,
      this'll cost time a lot.
      
      Here we introduce a head ref specific rbtree, it only has head refs, so troubles
      go away.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c46effa6
  28. 18 May, 2013 1 commit
    • Josef Bacik's avatar
      Btrfs: handle running extent ops with skinny metadata · b1c79e09
      Josef Bacik authored
      Chris hit a bug where we weren't finding extent records when running extent ops.
      This is because we use the delayed_ref_head when running the extent op, which
      means we can't use the ->type checks to see if we are metadata.  We also lose
      the level of the metadata we are working on.  So to fix this we can just check
      the ->is_data section of the extent_op, and we can store the level of the buffer
      we were modifying in the extent_op.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      b1c79e09
  29. 20 Feb, 2013 1 commit