1. 01 Jun, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: make xattr replace operations atomic · 72139bf0
      Filipe Manana authored
      commit 5f5bc6b1
      
       upstream.
      
      Replacing a xattr consists of doing a lookup for its existing value, delete
      the current value from the respective leaf, release the search path and then
      finally insert the new value. This leaves a time window where readers (getxattr,
      listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
      so this has security implications.
      
      This change also fixes 2 other existing issues which were:
      
      *) Deleting the old xattr value without verifying first if the new xattr will
         fit in the existing leaf item (in case multiple xattrs are packed in the
         same item due to name hash collision);
      
      *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
         exist but we have have an existing item that packs muliple xattrs with
         the same name hash as the input xattr. In this case we should return ENOSPC.
      
      A test case for xfstests follows soon.
      
      Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
      implementation.
      Reported-by: default avatarAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      [ luis: backported to 3.13 ]
      Reference: CVE-2014-9710
      BugLink: https://bugs.launchpad.net/bugs/1438501
      
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      72139bf0
  2. 24 Mar, 2014 2 commits
    • Filipe David Borba Manana's avatar
      Btrfs: fix tree mod logging · fc2345d5
      Filipe David Borba Manana authored
      commit 5de865ee upstream.
      
      While running the test btrfs/004 from xfstests in a loop, it failed
      about 1 time out of 20 runs in my desktop. The failure happened in
      the backref walking part of the test, and the test's error message was
      like this:
      
      #  btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
      #      --- tests/btrfs/004.out	2013-11-26 18:25:29.263333714 +0000
      #      +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad	2013-12-10 15:25:10.327518516 +0000
      #      @@ -1,3 +1,8 @@
      #       QA output created by 004
      #       *** test backref walking
      #      -*** done
      #      +unexpected output from
      #      +	/home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
      #      +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
      #      +
             ...
             (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
        Ran: btrfs/004
        Failures: btrfs/004
        Failed 1 of 1 tests
      
      But immediately after the test finished, the btrfs inspect-internal command
      returned the expected output:
      
        $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
        inode 405 offset 454656 root 258
        inode 405 offset 454656 root 5
      
      It turned out this was because the btrfs_search_old_slot() calls performed
      during backref walking (backref.c:__resolve_indirect_ref) were not finding
      anything. The reason for this turned out to be that the tree mod logging
      code was not logging some node multi-step operations atomically, therefore
      btrfs_search_old_slot() callers iterated often over an incomplete tree that
      wasn't fully consistent with any tree state from the past. Besides missing
      items, this often (but not always) resulted in -EIO errors during old slot
      searches, reported in dmesg like this:
      
      [ 4299.933936] ------------[ cut here ]------------
      [ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
      [ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
      [ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
      [ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
      [ 4299.933979]  000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
      [ 4299.933982]  0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
      [ 4299.933984]  ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
      [ 4299.933987] Call Trace:
      [ 4299.933991]  [<ffffffff8176d284>] dump_stack+0x55/0x76
      [ 4299.933994]  [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
      [ 4299.933997]  [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
      [ 4299.934003]  [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
      [ 4299.934005]  [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
      [ 4299.934010]  [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
      [ 4299.934019]  [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
      [ 4299.934027]  [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
      [ 4299.934034]  [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
      [ 4299.934042]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
      [ 4299.934048]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
      [ 4299.934056]  [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
      [ 4299.934058]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
      [ 4299.934065]  [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
      [ 4299.934071]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
      [ 4299.934078]  [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
      [ 4299.934080]  [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
      [ 4299.934083]  [<ffffffff81075563>] ? up_read+0x23/0x40
      [ 4299.934085]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
      [ 4299.934088]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
      [ 4299.934090]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
      [ 4299.934093]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
      [ 4299.934096]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
      [ 4299.934098]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
      [ 4299.934100]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
      [ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
      [ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
      [ 4299.934378] btrfs bad fsid on block 0
      
      These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
      tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
      be performed atomically before the following commit:
      
        c8cc6341
        (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
      
      That change removed the atomicity of such operations. This patch restores the
      atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
      structures, so it has to do the allocations using GFP_NOFS before acquiring
      the mod log lock.
      
      This issue has been experienced by several users recently, such as for example:
      
        http://www.spinics.net/lists/linux-btrfs/msg28574.html
      
      
      
      After running the btrfs/004 test for 679 consecutive iterations with this
      patch applied, I didn't ran into the issue anymore.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc2345d5
    • Filipe David Borba Manana's avatar
      Btrfs: return immediately if tree log mod is not necessary · 6ad309b5
      Filipe David Borba Manana authored
      commit 78357766
      
       upstream.
      
      In ctree.c:tree_mod_log_set_node_key() we were calling
      __tree_mod_log_insert_key() even when the modification doesn't need
      to be logged. This would allocate a tree_mod_elem structure, fill it
      and pass it to  __tree_mod_log_insert(), which would just acquire
      the tree mod log write lock and then free the tree_mod_elem structure
      and return (that is, a no-op).
      
      Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
      which just returns immediately if the modification doesn't need to be
      logged (without allocating the structure, fill it, acquire write lock,
      free structure).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ad309b5
  3. 12 Nov, 2013 8 commits
  4. 21 Sep, 2013 1 commit
  5. 01 Sep, 2013 9 commits
    • Filipe David Borba Manana's avatar
      Btrfs: optimize key searches in btrfs_search_slot · d7396f07
      Filipe David Borba Manana authored
      
      When the binary search returns 0 (exact match), the target key
      will necessarily be at slot 0 of all nodes below the current one,
      so in this case the binary search is not needed because it will
      always return 0, and we waste time doing it, holding node locks
      for longer than necessary, etc.
      
      Below follow histograms with the times spent on the current approach of
      doing a binary search when the previous binary search returned 0, and
      times for the new approach, which directly picks the first item/child
      node in the leaf/node.
      
      Current approach:
      
      Count: 6682
      Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
      Percentiles:  90th: 124.000; 95th: 145.000; 99th: 206.000
        35.000 -   61.080:  1235 ################
        61.080 -  106.053:  4207 #####################################################
       106.053 -  183.606:  1122 ##############
       183.606 -  317.341:   111 #
       317.341 -  547.959:     6 |
       547.959 - 8370.000:     1 |
      
      Approach proposed by this patch:
      
      Count: 6682
      Range:  6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev:  7.160
      Percentiles:  90th: 23.000; 95th: 27.000; 99th: 40.000
         6.000 -    8.418:    58 #
         8.418 -   11.670:  1149 #########################
        11.670 -   16.046:  2418 #####################################################
        16.046 -   21.934:  2098 ##############################################
        21.934 -   29.854:   744 ################
        29.854 -   40.511:   154 ###
        40.511 -   54.848:    41 #
        54.848 -   74.136:     5 |
        74.136 -  100.087:     9 |
       100.087 -  135.000:     6 |
      
      These samples were captured during a run of the btrfs tests 001, 002 and
      004 in the xfstests, with a leaf/node size of 4Kb.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      d7396f07
    • Geert Uytterhoeven's avatar
      Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long · b308bc2f
      Geert Uytterhoeven authored
      
      Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but
      casts it to a pointer, while all callers cast it to unsigned long again.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      b308bc2f
    • Geert Uytterhoeven's avatar
      Btrfs: Make btrfs_header_fsid() return unsigned long · fba6aa75
      Geert Uytterhoeven authored
      
      Internally, btrfs_header_fsid() calculates an unsigned long, but casts
      it to a pointer, while all callers cast it to unsigned long again.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      fba6aa75
    • Geert Uytterhoeven's avatar
      Btrfs: Remove superfluous casts from u64 to unsigned long long · c1c9ff7c
      Geert Uytterhoeven authored
      
      u64 is "unsigned long long" on all architectures now, so there's no need to
      cast it when formatting it using the "ll" length modifier.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      c1c9ff7c
    • Stefan Behrens's avatar
      Btrfs: get rid of sparse warnings · 35a3621b
      Stefan Behrens authored
      
      make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__
      
      I tried to filter out the warnings for which patches have already
      been sent to the mailing list, pending for inclusion in btrfs-next.
      
      All these changes should be obviously safe.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      35a3621b
    • Josef Bacik's avatar
      Btrfs: fix send issues related to inode number reuse · ba5e8f2e
      Josef Bacik authored
      
      If you are sending a snapshot and specifying a parent snapshot we will walk the
      trees and figure out where they differ and send the differences only.  The way
      we check for differences are if the leaves aren't the same and if the keys are
      not the same within the leaves.  So if neither leaf is the same (ie the leaf has
      been cow'ed from the parent snapshot) we walk each item in the send root and
      check it against the parent root.  If the items match exactly then we don't do
      anything.  This doesn't quite work for inode refs, since they will just have the
      name and the parent objectid.  If you move the file from a directory and then
      remove that directory and re-create a directory with the same inode number as
      the old directory and then move that file back into that directory we will
      assume that nothing changed and you will get errors when you try to receive.
      
      In order to fix this we need to do extra checking to see if the inode ref really
      is the same or not.  So do this by passing down BTRFS_COMPARE_TREE_SAME if the
      items match.  Then if the key type is an inode ref we can do some extra
      checking, otherwise we just keep processing.  The extra checking is to look up
      the generation of the directory in the parent volume and compare it to the
      generation of the send volume.  If they match then they are the same directory
      and we are good to go.  If they don't we have to add them to the changed refs
      list.
      
      This means we have to track the generation of the ref we're trying to lookup
      when we iterate all the refs for a particular inode.  So in the case of looking
      for new refs we have to get the generation from the parent volume, and in the
      case of looking for deleted refs we have to get the generation from the send
      volume to compare with.
      
      There was also the issue of using a ulist to keep track of the directories we
      needed to check.  Because we can get a deleted ref and a new ref for the same
      inode number the ulist won't work since it indexes based on the value.  So
      instead just dup any directory ref we find and add it to a local list, and then
      process that list as normal and do away with using a ulist for this altogether.
      
      Before we would fail all of the tests in the far-progs that related to moving
      directories (test group 32).  With this patch we now pass these tests, and all
      of the tests in the far-progs send testing suite.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      ba5e8f2e
    • Josef Bacik's avatar
      Btrfs: stop using GFP_ATOMIC when allocating rewind ebs · 9ec72677
      Josef Bacik authored
      
      There is no reason we can't just set the path to blocking and then do normal
      GFP_NOFS allocations for these extent buffers.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      9ec72677
    • Josef Bacik's avatar
      Btrfs: deal with enomem in the rewind path · db7f3436
      Josef Bacik authored
      
      We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
      tree mod log.  Instead of BUG_ON()'ing in this case pass up ENOMEM.  I looked
      back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
      in this path.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      db7f3436
    • Josef Bacik's avatar
      Btrfs: stop using GFP_ATOMIC for the tree mod log allocations · c8cc6341
      Josef Bacik authored
      
      Previously we held the tree mod lock when adding stuff because we use it to
      check and see if we truly do want to track tree modifications.  This is
      admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty
      hard and often is not nice.  So instead do our basic checks to see if we don't
      need to track modifications, and if those pass then do our allocation, and then
      when we go to insert the new modification check if we still care, and if we
      don't just free up our mod and return.  Otherwise we're good to go and we can
      carry on.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      c8cc6341
  6. 09 Aug, 2013 1 commit
  7. 02 Jul, 2013 2 commits
    • Josef Bacik's avatar
      Btrfs: only do the tree_mod_log_free_eb if this is our last ref · 7fb7d76f
      Josef Bacik authored
      
      There is another bug in the tree mod log stuff in that we're calling
      tree_mod_log_free_eb every single time a block is cow'ed.  The problem with this
      is that if this block is shared by multiple snapshots we will call this multiple
      times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
      in __tree_mod_log_rewind because we try to rewind a free twice.  We only want to
      call tree_mod_log_free_eb if we are actually freeing the block.  With this patch
      I no longer hit the panic in __tree_mod_log_rewind.  Thanks,
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      7fb7d76f
    • Josef Bacik's avatar
      Btrfs: hold the tree mod lock in __tree_mod_log_rewind · f1ca7e98
      Josef Bacik authored
      
      We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
      forward in the tree mod entries, otherwise we'll end up with random entries and
      trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
      people were seeing when running
      
      find /whatever -type f -exec btrfs fi defrag {} \;
      
      Thansk,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      f1ca7e98
  8. 01 Jul, 2013 2 commits
    • Josef Bacik's avatar
      Btrfs: optimize reada_for_balance · 0b08851f
      Josef Bacik authored
      
      This patch does two things.  First we no longer explicitly read in the blocks
      we're trying to readahead.  For things like balance_level we may never actually
      use the blocks so this just adds uneeded latency, and balance_level and
      split_node will both read in the blocks they care about explicitly so if the
      blocks need to be waited on it will be done there.  Secondly we no longer drop
      the path if we do readahead, we just set the path blocking before we call
      reada_for_balance() and then we're good to go.  Hopefully this will cut down on
      the number of re-searches.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      0b08851f
    • Josef Bacik's avatar
      Btrfs: optimize read_block_for_search · bdf7c00e
      Josef Bacik authored
      
      This patch does two things, first it only does one call to
      btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
      again with gen specified.  The other thing is to call btrfs_read_buffer() on the
      buffer we've found instead of dropping it and then calling read_tree_block().
      This will keep us from doing yet another radix tree lookup for a buffer we've
      already found.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      bdf7c00e
  9. 14 Jun, 2013 3 commits
  10. 28 May, 2013 1 commit
  11. 18 May, 2013 1 commit
    • Josef Bacik's avatar
      Btrfs: handle running extent ops with skinny metadata · b1c79e09
      Josef Bacik authored
      
      Chris hit a bug where we weren't finding extent records when running extent ops.
      This is because we use the delayed_ref_head when running the extent op, which
      means we can't use the ->type checks to see if we are metadata.  We also lose
      the level of the metadata we are working on.  So to fix this we can just check
      the ->is_data section of the extent_op, and we can store the level of the buffer
      we were modifying in the extent_op.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      b1c79e09
  12. 06 May, 2013 9 commits