1. 08 Jan, 2020 4 commits
    • Johannes Thumshirn's avatar
      btrfs: fix memory leak in qgroup accounting · 26ef8493
      Johannes Thumshirn authored
      When running xfstests on the current btrfs I get the following splat from
      kmemleak:
      
      unreferenced object 0xffff88821b2404e0 (size 32):
        comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s)
        hex dump (first 32 bytes):
          01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff  ...........&....
          10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff  ...&.... ..&....
        backtrace:
          [<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs]
          [<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs]
          [<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs]
          [<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs]
          [<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs]
          [<00000000f0188930>] process_one_work+0x1cf/0x350
          [<00000000af5f2f8e>] worker_thread+0x28/0x3c0
          [<00000000b55a1add>] kthread+0x109/0x120
          [<00000000f88cbd17>] ret_from_fork+0x35/0x40
      
      This corresponds to:
      
        (gdb) l *(btrfs_find_all_roots_safe+0x41)
        0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413).
        1408
        1409            tmp = ulist_alloc(GFP_NOFS);
        1410            if (!tmp)
        1411                    return -ENOMEM;
        1412            *roots = ulist_alloc(GFP_NOFS);
        1413            if (!*roots) {
        1414                    ulist_free(tmp);
        1415                    return -ENOMEM;
        1416            }
        1417
      
      Following the lifetime of the allocated 'roots' ulist, it gets freed
      again in btrfs_qgroup_account_extent().
      
      But this does not happen if the function is called with the
      'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent()
      does a short leave and directly returns.
      
      Instead of directly returning we should jump to the 'out_free' in order to
      free all resources as expected.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26ef8493
    • Josef Bacik's avatar
      btrfs: do not delete mismatched root refs · 423a716c
      Josef Bacik authored
      btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in
      any way, and then continue to delete the reference.  This shouldn't
      happen, we have these values because there's more to the reference than
      the original root and the sub root.  If any of these checks fail, return
      -ENOENT.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      423a716c
    • Josef Bacik's avatar
      btrfs: fix invalid removal of root ref · d49d3287
      Josef Bacik authored
      If we have the following sequence of events
      
        btrfs sub create A
        btrfs sub create A/B
        btrfs sub snap A C
        mkdir C/foo
        mv A/B C/foo
        rm -rf *
      
      We will end up with a transaction abort.
      
      The reason for this is because we create a root ref for B pointing to A.
      When we create a snapshot of C we still have B in our tree, but because
      the root ref points to A and not C we will make it appear to be empty.
      
      The problem happens when we move B into C.  This removes the root ref
      for B pointing to A and adds a ref of B pointing to C.  When we rmdir C
      we'll see that we have a ref to our root and remove the root ref,
      despite not actually matching our reference name.
      
      Now btrfs_del_root_ref() allowing this to work is a bug as well, however
      we know that this inode does not actually point to a root ref in the
      first place, so we shouldn't be calling btrfs_del_root_ref() in the
      first place and instead simply look up our dir index for this item and
      do the rest of the removal.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d49d3287
    • Josef Bacik's avatar
      btrfs: rework arguments of btrfs_unlink_subvol · 045d3967
      Josef Bacik authored
      btrfs_unlink_subvol takes the name of the dentry and the root objectid
      based on what kind of inode this is, either a real subvolume link or a
      empty one that we inherited as a snapshot.  We need to fix how we unlink
      in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework
      btrfs_unlink_subvol to just take the dentry and handle getting the right
      objectid given the type of inode this is.  There is no functional change
      here, simply pushing the work into btrfs_unlink_subvol() proper.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      045d3967
  2. 30 Dec, 2019 3 commits
    • Filipe Manana's avatar
      Btrfs: fix infinite loop during nocow writeback due to race · de7999af
      Filipe Manana authored
      When starting writeback for a range that covers part of a preallocated
      extent, due to a race with writeback for another range that also covers
      another part of the same preallocated extent, we can end up in an infinite
      loop.
      
      Consider the following example where for inode 280 we have two dirty
      ranges:
      
        range A, from 294912 to 303103, 8192 bytes
        range B, from 348160 to 438271, 90112 bytes
      
      and we have the following file extent item layout for our inode:
      
        leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
            (...)
            item 27 key (280 108 200704) itemoff 14598 itemsize 53
                extent data disk bytenr 0 nr 0 type 1 (regular)
                extent data offset 0 nr 94208 ram 94208
            item 28 key (280 108 294912) itemoff 14545 itemsize 53
                extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
                extent data offset 0 nr 81920 ram 81920
      
      Then the following happens:
      
      1) Writeback starts for range B (from 348160 to 438271), execution of
         run_delalloc_nocow() starts;
      
      2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
         the extent item at slot 28, pointing to the prealloc extent item
         covering the range from 294912 to 376831. This extent covers part of
         our range;
      
      3) An ordered extent is created against that extent, covering the file
         range from 348160 to 376831 (28672 bytes);
      
      4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
         the while loop;
      
      5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
         pointing to slot 29, 1 slot after the last item (the extent item
         we processed in the previous iteration);
      
      6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
         which releases the search path before doing a another search for the
         last key of the leaf (280 108 294912);
      
      7) Right after btrfs_next_leaf() released the path, and before it did
         another search for the last key of the leaf, writeback for the range
         A (from 294912 to 303103) completes (it was previously started at
         some point);
      
      8) Upon completion of the ordered extent for range A, the prealloc extent
         we previously found got split into two extent items, one covering the
         range from 294912 to 303103 (8192 bytes), with a type of regular extent
         (and no longer prealloc) and another covering the range from 303104 to
         376831 (73728 bytes), with a type of prealloc and an offset of 8192
         bytes. So our leaf now has the following layout:
      
           leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
               (...)
               item 27 key (280 108 200704) itemoff 14598 itemsize 53
                   extent data disk bytenr 0 nr 0 type 1
                   extent data offset 0 nr 8192 ram 94208
               item 28 key (280 108 208896) itemoff 14545 itemsize 53
                   extent data disk bytenr 10433142784 nr 86016 type 1
                   extent data offset 0 nr 86016 ram 86016
               item 29 key (280 108 294912) itemoff 14492 itemsize 53
                   extent data disk bytenr 10433052672 nr 81920 type 1
                   extent data offset 0 nr 8192 ram 81920
               item 30 key (280 108 303104) itemoff 14439 itemsize 53
                   extent data disk bytenr 10433052672 nr 81920 type 2
                   extent data offset 8192 nr 73728 ram 81920
      
      9) After btrfs_next_leaf() returns, we have our path pointing to that same
         leaf and at slot 30, since it has a key we didn't have before and it's
         the first key greater then the key that was previously the last key of
         the leaf (key (280 108 294912));
      
      10) The extent item at slot 30 covers the range from 303104 to 376831
          which is in our target range, so we process it, despite having already
          created an ordered extent against this extent for the file range from
          348160 to 376831. This is because we skip to the next extent item only
          if its end is less than or equals to the start of our delalloc range,
          and not less than or equals to the current offset ('cur_offset');
      
      11) As a result we compute 'num_bytes' as:
      
          num_bytes = min(end + 1, extent_end) - cur_offset;
                    = min(438271 + 1, 376832) - 376832 = 0
      
      12) We then call create_io_em() for a 0 bytes range starting at offset
          376832;
      
      13) Then create_io_em() enters an infinite loop because its calls to
          btrfs_drop_extent_cache() do nothing due to the 0 length range
          passed to it. So no existing extent maps that cover the offset
          376832 get removed, and therefore calls to add_extent_mapping()
          return -EEXIST, resulting in an infinite loop. This loop from
          create_io_em() is the following:
      
          do {
              btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
                                      em->start + em->len - 1, 0);
              write_lock(&em_tree->lock);
              ret = add_extent_mapping(em_tree, em, 1);
              write_unlock(&em_tree->lock);
              /*
               * The caller has taken lock_extent(), who could race with us
               * to add em?
               */
          } while (ret == -EEXIST);
      
      Also, each call to btrfs_drop_extent_cache() triggers a warning because
      the start offset passed to it (376832) is smaller then the end offset
      (376832 - 1) passed to it by -1, due to the 0 length:
      
        [258532.052621] ------------[ cut here ]------------
        [258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
        (...)
        [258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G        W         5.4.0-rc7-btrfs-next-64 #1
        [258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
        (...)
        [258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287
        [258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000
        [258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308
        [258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001
        [258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000
        [258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308
        [258532.052705] FS:  00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000
        [258532.052707] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0
        [258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [258532.052717] Call Trace:
        [258532.052718]  ? preempt_schedule_common+0x32/0x70
        [258532.052722]  ? ___preempt_schedule+0x16/0x20
        [258532.052741]  create_io_em+0xff/0x180 [btrfs]
        [258532.052767]  run_delalloc_nocow+0x942/0xb10 [btrfs]
        [258532.052791]  btrfs_run_delalloc_range+0x30b/0x520 [btrfs]
        [258532.052812]  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        [258532.052834]  writepage_delalloc+0xe4/0x140 [btrfs]
        [258532.052855]  __extent_writepage+0x110/0x4e0 [btrfs]
        [258532.052876]  extent_write_cache_pages+0x21c/0x480 [btrfs]
        [258532.052906]  extent_writepages+0x52/0xb0 [btrfs]
        [258532.052911]  do_writepages+0x23/0x80
        [258532.052915]  __filemap_fdatawrite_range+0xd2/0x110
        [258532.052938]  btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
        [258532.052954]  start_ordered_ops+0x57/0xa0 [btrfs]
        [258532.052973]  ? btrfs_sync_file+0x225/0x490 [btrfs]
        [258532.052988]  btrfs_sync_file+0x225/0x490 [btrfs]
        [258532.052997]  __x64_sys_msync+0x199/0x200
        [258532.053004]  do_syscall_64+0x5c/0x250
        [258532.053007]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [258532.053010] RIP: 0033:0x7f86a7dfd760
        (...)
        [258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
        [258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760
        [258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000
        [258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c
        [258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3
        [258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001
        [258532.053032] irq event stamp: 1653450494
        [258532.053035] hardirqs last  enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50
        [258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20
        [258532.053039] softirqs last  enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd
        [258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120
        [258532.053043] ---[ end trace 8476fce13d9ce20a ]---
      
      Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
      uses WARN_ON() and not WARN_ON_ONCE().
      
      So fix this issue by changing run_delalloc_nocow()'s loop to move to the
      next extent item when the current extent item ends at at offset less than
      or equals to the current offset instead of the start offset.
      
      Fixes: 80ff3856 ("Btrfs: update nodatacow code v2")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de7999af
    • Dennis Zhou's avatar
      btrfs: fix compressed write bio blkcg attribution · 46bcff2b
      Dennis Zhou authored
      Bio attribution is handled at bio_set_dev() as once we have a device, we
      have a corresponding request_queue and then can derive the current css.
      In special cases, we want to attribute to bio to someone else. This can
      be done by calling bio_associate_blkg_from_css() or
      kthread_associate_blkcg() depending on the scenario. Btrfs does this for
      compressed writeback as they are handled by kworkers, so the latter can
      be done here.
      
      Commit 1a418027 ("btrfs: drop bio_set_dev where not needed") removes
      early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the
      above assumption that we'll have a request_queue when we are doing
      association. To fix this, switch to using kthread_associate_blkcg().
      
      Without this, we crash in btrfs/024:
      
        [ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510
        [ 3052.107013] #PF: supervisor read access in kernel mode
        [ 3052.107014] #PF: error_code(0x0000) - not-present page
        [ 3052.107015] PGD 0 P4D 0
        [ 3052.107021] Oops: 0000 [#1] SMP
        [ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712
        [ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        [ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper
        [ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0
        [ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282
        [ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000
        [ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0
        [ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400
        [ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0
        [ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8
        [ 3052.293367] FS:  0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000
        [ 3052.293368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0
        [ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [ 3052.293372] PKRU: 55555554
        [ 3052.293376] Call Trace:
        [ 3052.402552]  btrfs_submit_compressed_write+0x137/0x390
        [ 3052.402558]  submit_compressed_extents+0x40f/0x4c0
        [ 3052.422401]  btrfs_work_helper+0x246/0x5a0
        [ 3052.422408]  process_one_work+0x200/0x570
        [ 3052.438601]  ? process_one_work+0x180/0x570
        [ 3052.438605]  worker_thread+0x4c/0x3e0
        [ 3052.438614]  kthread+0x103/0x140
        [ 3052.460735]  ? process_one_work+0x570/0x570
        [ 3052.460737]  ? kthread_mod_delayed_work+0xc0/0xc0
        [ 3052.460744]  ret_from_fork+0x24/0x30
      
      Fixes: 1a418027 ("btrfs: drop bio_set_dev where not needed")
      Reported-by: default avatarChris Murphy <chris@colorremedies.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46bcff2b
    • Dennis Zhou's avatar
      btrfs: punt all bios created in btrfs_submit_compressed_write() · 7b62e66c
      Dennis Zhou authored
      Compressed writes happen in the background via kworkers. However, this
      causes bios to be attributed to root bypassing any cgroup limits from
      the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will
      punt the bio to an appropriate cgroup specific workqueue and attribute
      the IO properly. However, if btrfs_submit_compressed_write() creates a
      new bio, we don't tag it the same way. Add the appropriate tagging for
      subsequent bios.
      
      Fixes: ec39f769 ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios")
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7b62e66c
  3. 13 Dec, 2019 17 commits
    • Anand Jain's avatar
      btrfs: send: remove WARN_ON for readonly mount · fbd54297
      Anand Jain authored
      We log warning if root::orphan_cleanup_state is not set to
      ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
      mounted as readonly we skip the orphan item cleanup during the lookup
      and root::orphan_cleanup_state remains at the init state 0 instead of
      ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
      warning as below.
      
        WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);
      
      WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
      ::
      RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
      ::
      Call Trace:
      ::
      _btrfs_ioctl_send+0x7b/0x110 [btrfs]
      btrfs_ioctl+0x150a/0x2b00 [btrfs]
      ::
      do_vfs_ioctl+0xa9/0x620
      ? __fget+0xac/0xe0
      ksys_ioctl+0x60/0x90
      __x64_sys_ioctl+0x16/0x20
      do_syscall_64+0x49/0x130
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reproducer:
        mkfs.btrfs -fq /dev/sdb
        mount /dev/sdb /btrfs
        btrfs subvolume create /btrfs/sv1
        btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
        umount /btrfs
        mount -o ro /dev/sdb /btrfs
        btrfs send /btrfs/ss1 -f /tmp/f
      
      The warning exists because having orphan inodes could confuse send and
      cause it to fail or produce incorrect streams.  The two cases that would
      cause such send failures, which are already fixed are:
      
      1) Inodes that were unlinked - these are orphanized and remain with a
         link count of 0. These caused send operations to fail because it
         expected to always find at least one path for an inode. However this
         is no longer a problem since send is now able to deal with such
         inodes since commit 46b2f459 ("Btrfs: fix send failure when root
         has deleted files still open") and treats them as having been
         completely removed (the state after an orphan cleanup is performed).
      
      2) Inodes that were in the process of being truncated. These resulted in
         send not knowing about the truncation and potentially issue write
         operations full of zeroes for the range from the new file size to the
         old file size. This is no longer a problem because we no longer
         create orphan items for truncation since commit f7e9e8fc ("Btrfs:
         stop creating orphan items for truncate").
      
      As such before these commits, the WARN_ON here provided a clue in case
      something went wrong. Instead of being a warning against the
      root::orphan_cleanup_state value, it could have been more accurate by
      checking if there were actually any orphan items, and then issue a
      warning only if any exists, but that would be more expensive to check.
      Since orphanized inodes no longer cause problems for send, just remove
      the warning.
      Reported-by: default avatarChristoph Anton Mitterer <calestyo@scientia.net>
      Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
      CC: stable@vger.kernel.org # 4.19+
      Suggested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbd54297
    • Josef Bacik's avatar
      btrfs: do not leak reloc root if we fail to read the fs root · ca1aa281
      Josef Bacik authored
      If we fail to read the fs root corresponding with a reloc root we'll
      just break out and free the reloc roots.  But we remove our current
      reloc_root from this list higher up, which means we'll leak this
      reloc_root.  Fix this by adding ourselves back to the reloc_roots list
      so we are properly cleaned up.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca1aa281
    • Josef Bacik's avatar
      btrfs: skip log replay on orphaned roots · 9bc574de
      Josef Bacik authored
      My fsstress modifications coupled with generic/475 uncovered a failure
      to mount and replay the log if we hit a orphaned root.  We do not want
      to replay the log for an orphan root, but it's completely legitimate to
      have an orphaned root with a log attached.  Fix this by simply skipping
      replaying the log.  We still need to pin it's root node so that we do
      not overwrite it while replaying other logs, as we re-read the log root
      at every stage of the replay.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9bc574de
    • Josef Bacik's avatar
      btrfs: handle ENOENT in btrfs_uuid_tree_iterate · 714cd3e8
      Josef Bacik authored
      If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
      uuid tree we'll just continue and do btrfs_next_item().  However we've
      done a btrfs_release_path() at this point and no longer have a valid
      path.  So increment the key and go back and do a normal search.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      714cd3e8
    • Josef Bacik's avatar
      btrfs: abort transaction after failed inode updates in create_subvol · c7e54b51
      Josef Bacik authored
      We can just abort the transaction here, and in fact do that for every
      other failure in this function except these two cases.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7e54b51
    • Filipe Manana's avatar
      Btrfs: fix hole extent items with a zero size after range cloning · 147271e3
      Filipe Manana authored
      Normally when cloning a file range if we find an implicit hole at the end
      of the range we assume it is because the NO_HOLES feature is enabled.
      However that is not always the case. One well known case [1] is when we
      have a power failure after mixing buffered and direct IO writes against
      the same file.
      
      In such cases we need to punch a hole in the destination file, and if
      the NO_HOLES feature is not enabled, we need to insert explicit file
      extent items to represent the hole. After commit 690a5dbf
      ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning
      extents"), we started to insert file extent items representing the hole
      with an item size of 0, which is invalid and should be 53 bytes (the size
      of a btrfs_file_extent_item structure), resulting in all sorts of
      corruptions and invalid memory accesses. This is detected by the tree
      checker when we attempt to write a leaf to disk.
      
      The problem can be sporadically triggered by test case generic/561 from
      fstests. That test case does not exercise power failure and creates a new
      filesystem when it starts, so it does not use a filesystem created by any
      previous test that tests power failure. However the test does both
      buffered and direct IO writes (through fsstress) and it's precisely that
      which is creating the implicit holes in files. That happens even before
      the commit mentioned earlier. I need to investigate why we get those
      implicit holes to check if there is a real problem or not. For now this
      change fixes the regression of introducing file extent items with an item
      size of 0 bytes.
      
      Fix the issue by calling btrfs_punch_hole_range() without passing a
      btrfs_clone_extent_info structure, which ensures file extent items are
      inserted to represent the hole with a correct item size. We were passing
      a btrfs_clone_extent_info with a value of 0 for its 'item_size' field,
      which was causing the insertion of file extent items with an item size
      of 0.
      
      [1] https://www.spinics.net/lists/linux-btrfs/msg75350.htmlReported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: 690a5dbf ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      147271e3
    • Filipe Manana's avatar
      Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues · 6609fee8
      Filipe Manana authored
      When a tree mod log user no longer needs to use the tree it calls
      btrfs_put_tree_mod_seq() to remove itself from the list of users and
      delete all no longer used elements of the tree's red black tree, which
      should be all elements with a sequence number less then our equals to
      the caller's sequence number. However the logic is broken because it
      can delete and free elements from the red black tree that have a
      sequence number greater then the caller's sequence number:
      
      1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
         tree mod log;
      
      2) The task which got assigned the sequence number 1 calls
         btrfs_put_tree_mod_seq();
      
      3) Sequence number 1 is deleted from the list of sequence numbers;
      
      4) The current minimum sequence number is computed to be the sequence
         number 2;
      
      5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
         a pointer to one of its elements from the red black tree through
         a call to tree_mod_log_search();
      
      6) The task with sequence number 1 iterates the red black tree of tree
         modification elements and deletes (and frees) all elements with a
         sequence number less then or equals to 2 (the computed minimum sequence
         number) - it ends up only leaving elements with sequence numbers of 3
         and 4;
      
      7) The task with sequence number 2 now uses the pointer to its element,
         already freed by the other task, at __tree_mod_log_rewind(), resulting
         in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
         a trace like the following:
      
        [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G        W         5.4.0-rc8-btrfs-next-51 #1
        [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [16804.548666] RIP: 0010:rb_next+0x16/0x50
        (...)
        [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
        [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
        [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
        [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
        [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
        [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
        [16804.554399] FS:  00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
        [16804.555039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
        [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [16804.557583] Call Trace:
        [16804.558207]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
        [16804.558835]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
        [16804.559468]  resolve_indirect_refs+0x1eb/0xc70 [btrfs]
        [16804.560087]  ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
        [16804.560700]  find_parent_nodes+0x388/0x1120 [btrfs]
        [16804.561310]  btrfs_check_shared+0x115/0x1c0 [btrfs]
        [16804.561916]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
        [16804.562518]  extent_fiemap+0x59d/0x6d0 [btrfs]
        [16804.563112]  ? __might_fault+0x11/0x90
        [16804.563706]  do_vfs_ioctl+0x45a/0x700
        [16804.564299]  ksys_ioctl+0x70/0x80
        [16804.564885]  ? trace_hardirqs_off_thunk+0x1a/0x20
        [16804.565461]  __x64_sys_ioctl+0x16/0x20
        [16804.566020]  do_syscall_64+0x5c/0x250
        [16804.566580]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [16804.567153] RIP: 0033:0x7f4b1ba2add7
        (...)
        [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
        [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
        [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
        [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
        [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
        (...)
        [16804.575623] ---[ end trace 87317359aad4ba50 ]---
      
      Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
      have a sequence number equals to the computed minimum sequence number, and
      not just elements with a sequence number greater then that minimum.
      
      Fixes: bd989ba3 ("Btrfs: add tree modification log functions")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6609fee8
    • Filipe Manana's avatar
      Btrfs: make tree checker detect checksum items with overlapping ranges · ad1d8c43
      Filipe Manana authored
      Having checksum items, either on the checksums tree or in a log tree, that
      represent ranges that overlap each other is a sign of a corruption. Such
      case confuses the checksum lookup code and can result in not being able to
      find checksums or find stale checksums.
      
      So add a check for such case.
      
      This is motivated by a recent fix for a case where a log tree had checksum
      items covering ranges that overlap each other due to extent cloning, and
      resulted in missing checksums after replaying the log tree. It also helps
      detect past issues such as stale and outdated checksums due to overlapping,
      commit 27b9a812 ("Btrfs: fix csum tree corruption, duplicate and
      outdated checksums").
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad1d8c43
    • Filipe Manana's avatar
      Btrfs: fix missing data checksums after replaying a log tree · 40e046ac
      Filipe Manana authored
      When logging a file that has shared extents (reflinked with other files or
      with itself), we can end up logging multiple checksum items that cover
      overlapping ranges. This confuses the search for checksums at log replay
      time causing some checksums to never be added to the fs/subvolume tree.
      
      Consider the following example of a file that shares the same extent at
      offsets 0 and 256Kb:
      
         [ bytenr 13893632, offset 64Kb, len 64Kb  ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 13893632, offset 0, len 256Kb    ]
         256Kb                                     512Kb
      
      When logging the inode, at tree-log.c:copy_items(), when processing the
      file extent item at offset 0, we log a checksum item covering the range
      13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
      64Kb + 64Kb, respectively.
      
      Later when processing the extent item at offset 256K, we log the checksums
      for the range from 13893632 to 14155776 (which corresponds to 13893632 +
      256Kb). These checksums get merged with the checksum item for the range
      from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
      So after this we get the two following checksum items in the log tree:
      
         (...)
         item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
                 range start 13631488 end 14155776 length 524288
         item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
                 range start 13959168 end 14024704 length 65536
      
      The first one covers the range from the second one, they overlap.
      
      So far this does not cause a problem after replaying the log, because
      when replaying the file extent item for offset 256K, we copy all the
      checksums for the extent 13893632 from the log tree to the fs/subvolume
      tree, since searching for an checksum item for bytenr 13893632 leaves us
      at the first checksum item, which covers the whole range of the extent.
      
      However if we write 64Kb to file offset 256Kb for example, we will
      not be able to find and copy the checksums for the last 128Kb of the
      extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
      
      After writing 64Kb into file offset 256Kb we get the following extent
      layout for our file:
      
         [ bytenr 13893632, offset 64K, len 64Kb   ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 14155776, offset 0, len 64Kb     ]
         256Kb                                     320Kb
      
         [ bytenr 13893632, offset 64Kb, len 192Kb ]
         320Kb                                     512Kb
      
      After fsync'ing the file, if we have a power failure and then mount
      the filesystem to replay the log, the following happens:
      
      1) When replaying the file extent item for file offset 320Kb, we
         lookup for the checksums for the extent range from 13959168
         (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
         to btrfs_lookup_csums_range();
      
      2) btrfs_lookup_csums_range() finds the checksum item that starts
         precisely at offset 13959168 (item 7 in the log tree, shown before);
      
      3) However that checksum item only covers 64Kb of data, and not 192Kb
         of data;
      
      4) As a result only the checksums for the first 64Kb of data referenced
         by the file extent item are found and copied to the fs/subvolume tree.
         The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
         the corresponding data checksums found and copied to the fs/subvolume
         tree.
      
      5) After replaying the log userspace will not be able to read the file
         range from 384Kb to 512Kb, because the checksums are missing and
         resulting in an -EIO error.
      
      The following steps reproduce this scenario:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
      
        $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        <power failure>
      
        $ mount /dev/sdc /mnt/sdc
        $ md5sum /mnt/sdc/foobar
        md5sum: /mnt/sdc/foobar: Input/output error
      
        $ dmesg | tail
        [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
        [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
        [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
        [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
        [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
        [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
        [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
        [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
        [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
        [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
      
      Fix this simply by deleting first any checksums, from the log tree, for the
      range of the extent we are logging at copy_items(). This ensures we do not
      get checksum items in the log tree that have overlapping ranges.
      
      This is a long time issue that has been present since we have the clone
      (and deduplication) ioctl, and can happen both when an extent is shared
      between different files and within the same file.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40e046ac
    • Dan Carpenter's avatar
      btrfs: return error pointer from alloc_test_extent_buffer · b6293c82
      Dan Carpenter authored
      Callers of alloc_test_extent_buffer have not correctly interpreted the
      return value as error pointer, as alloc_test_extent_buffer should behave
      as alloc_extent_buffer. The self-tests were unaffected but
      btrfs_find_create_tree_block could call both functions and that would
      cause problems up in the call chain.
      
      Fixes: faa2dbf0 ("Btrfs: add sanity tests for new qgroup accounting code")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6293c82
    • David Sterba's avatar
      btrfs: fix devs_max constraints for raid1c3 and raid1c4 · cf93e15e
      David Sterba authored
      The value 0 for devs_max means to spread the allocated chunks over all
      available devices, eg. stripe for RAID0 or RAID5. This got mistakenly
      copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and
      4 copies respectively.
      
      Fixes: 47e6f742 ("btrfs: add support for 3-copy replication (raid1c3)")
      Fixes: 8d6fac00 ("btrfs: add support for 4-copy replication (raid1c4)")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf93e15e
    • Andreas Färber's avatar
      btrfs: tree-checker: Fix error format string for size_t · 994bf9cd
      Andreas Färber authored
      Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(),
      which returns type size_t, so we need %zu instead of %lu.
      
      This fixes a build warning on 32-bit ARM:
      
        ../fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
        ../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=]
          230 |     "invalid item size, have %u expect [%lu, %u)",
              |                                         ~~^
              |                                           long unsigned int
              |                                         %u
      
      Fixes: 153a6d29 ("btrfs: tree-checker: Check item size before reading file extent type")
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndreas Färber <afaerber@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      994bf9cd
    • Josef Bacik's avatar
      btrfs: don't double lock the subvol_sem for rename exchange · 943eb3bf
      Josef Bacik authored
      If we're rename exchanging two subvols we'll try to lock this lock
      twice, which is bad.  Just lock once if either of the ino's are subvols.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      943eb3bf
    • Josef Bacik's avatar
      btrfs: handle error in btrfs_cache_block_group · db8fe64f
      Josef Bacik authored
      We have a BUG_ON(ret < 0) in find_free_extent from
      btrfs_cache_block_group.  If we fail to allocate our ctl we'll just
      panic, which is not good.  Instead just go on to another block group.
      If we fail to find a block group we don't want to return ENOSPC, because
      really we got a ENOMEM and that's the root of the problem.  Save our
      return from btrfs_cache_block_group(), and then if we still fail to make
      our allocation return that ret so we get the right error back.
      
      Tested with inject-error.py from bcc.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db8fe64f
    • Josef Bacik's avatar
      btrfs: do not call synchronize_srcu() in inode_tree_del · f72ff01d
      Josef Bacik authored
      Testing with the new fsstress uncovered a pretty nasty deadlock with
      lookup and snapshot deletion.
      
      Process A
      unlink
       -> final iput
         -> inode_tree_del
           -> synchronize_srcu(subvol_srcu)
      
      Process B
      btrfs_lookup  <- srcu_read_lock() acquired here
        -> btrfs_iget
          -> find inode that has I_FREEING set
            -> __wait_on_freeing_inode()
      
      We're holding the srcu_read_lock() while doing the iget in order to make
      sure our fs root doesn't go away, and then we are waiting for the inode
      to finish freeing.  However because the free'ing process is doing a
      synchronize_srcu() we deadlock.
      
      Fix this by dropping the synchronize_srcu() in inode_tree_del().  We
      don't need people to stop accessing the fs root at this point, we're
      only adding our empty root to the dead roots list.
      
      A larger much more invasive fix is forthcoming to address how we deal
      with fs roots, but this fixes the immediate problem.
      
      Fixes: 76dda93c ("Btrfs: add snapshot/subvolume destroy ioctl")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f72ff01d
    • Filipe Manana's avatar
      Btrfs: fix cloning range with a hole when using the NO_HOLES feature · fcb97058
      Filipe Manana authored
      When using the NO_HOLES feature if we clone a range that contains a hole
      and a temporary ENOSPC happens while dropping extents from the target
      inode's range, we can end up failing and aborting the transaction with
      -EEXIST or with a corrupt file extent item, that has a length greater
      than it should and overlaps with other extents. For example when cloning
      the following range from inode A to inode B:
      
        Inode A:
      
          extent A1                                          extent A2
        [ ----------- ]  [ hole, implicit, 4MB length ]  [ ------------- ]
        0            1MB                                 5MB            6MB
      
        Range to clone: [1MB, 6MB)
      
        Inode B:
      
          extent B1       extent B2        extent B3         extent B4
        [ ---------- ]  [ --------- ]    [ ---------- ]    [ ---------- ]
        0           1MB 1MB        2MB   2MB        5MB    5MB         6MB
      
        Target range: [1MB, 6MB) (same as source, to make it easier to explain)
      
      The following can happen:
      
      1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
      
      2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
         set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
      
      3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
         1MB;
      
      4) We then attempt to insert a file extent item at inode B with a file
         offset of 5MB, which is the value of clone_info->file_offset. This
         fails with error -EEXIST because there's already an extent at that
         offset (extent B4);
      
      5) We abort the current transaction with -EEXIST and return that error
         to user space as well.
      
      Another example, for extent corruption:
      
        Inode A:
      
          extent A1                                           extent A2
        [ ----------- ]   [ hole, implicit, 10MB length ]  [ ------------- ]
        0            1MB                                  11MB            12MB
      
        Inode B:
      
          extent B1         extent B2
        [ ----------- ]   [ --------- ]    [ ----------------------------- ]
        0            1MB 1MB         5MB  5MB                             12MB
      
        Target range: [1MB, 12MB) (same as source, to make it easier to explain)
      
      1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
      
      2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
         set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
      
      3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
         4MB;
      
      4) We then insert a file extent item at inode B with a file offset of 11MB
         which is the value of clone_info->file_offset, and a length of 4MB (the
         value of 'clone_len'). So we get 2 extents items with ranges that
         overlap and an extent length of 4MB, larger then the extent A2 from
         inode A (1MB length);
      
      5) After that we end the transaction, balance the btree dirty pages and
         then start another or join the previous transaction. It might happen
         that the transaction which inserted the incorrect extent was committed
         by another task so we end up with extent corruption if a power failure
         happens.
      
      So fix this by making sure we attempt to insert the extent to clone at
      the destination inode only if we are past dropping the sub-range that
      corresponds to a hole.
      
      Fixes: 690a5dbf ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fcb97058
    • Nikolay Borisov's avatar
      btrfs: Fix error messages in qgroup_rescan_init · 37d02592
      Nikolay Borisov authored
      The branch of qgroup_rescan_init which is executed from the mount
      path prints wrong errors messages. The textual print out in case
      BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not
      set are transposed. Fix it by exchanging their place.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      37d02592
  4. 18 Nov, 2019 16 commits
    • David Sterba's avatar
      btrfs: drop bdev argument from submit_extent_page · fa17ed06
      David Sterba authored
      After previous patches removing bdev being passed around to set it to
      bio, it has become unused in submit_extent_page. So it now has "only" 13
      parameters.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa17ed06
    • David Sterba's avatar
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba authored
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a019e9e1
    • David Sterba's avatar
      btrfs: drop bio_set_dev where not needed · 1a418027
      David Sterba authored
      bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
      also changing some state bits if there was a different bdev set before.
      This is one thing that's not needed.
      
      Another thing is that setting a bdev at bio allocation time is too early
      and actually does not work with plain redundancy profiles, where each
      time we submit a bio to a device, the bdev is set correctly.
      
      In many places the bio bdev is set to latest_bdev that seems to serve as
      a stub pointer "just to put something to bio". But we don't have to do
      that.
      
      Where do we know which bdev to set:
      
      * for regular IO: submit_stripe_bio that's called by btrfs_map_bio
      
      * repair IO: repair_io_failure, read or write from specific device
      
      * super block write (using buffer_heads but uses raw bdev) and barriers
      
      * scrub: this does not use all regular IO paths as it needs to reach all
        copies, verify and fixup eventually, and for that all bdev management
        is independent
      
      * raid56: rbio_add_io_page, for the RMW write
      
      * integrity-checker: does it's own low-level block tracking
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a418027
    • David Sterba's avatar
      btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0
      David Sterba authored
      This is preparatory patch to remove @bdev parameter from
      submit_extent_page. It can't be removed completely, because the cgroups
      need it for wbc when initializing the bio
      
      wbc_init_bio
        bio_associate_blkg_from_css
          dereference bdev->bi_disk->queue
      
      The bdev pointer is the same as latest_bdev, thus no functional change.
      We can retrieve it from fs_devices that's reachable through several
      dereferences. The local variable shadows the parameter, but that's only
      temporary.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      429aebc0
    • Josef Bacik's avatar
      btrfs: record all roots for rename exchange on a subvol · 3e174099
      Josef Bacik authored
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e174099
    • Filipe Manana's avatar
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana authored
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      042528f8
    • Qu Wenruo's avatar
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo authored
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b12de528
    • Johannes Thumshirn's avatar
      btrfs: change btrfs_fs_devices::rotating to bool · 7f0432d0
      Johannes Thumshirn authored
      struct btrfs_fs_devices::rotating currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f0432d0
    • Johannes Thumshirn's avatar
      btrfs: change btrfs_fs_devices::seeding to bool · 0395d84f
      Johannes Thumshirn authored
      struct btrfs_fs_devices::seeding currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0395d84f
    • David Sterba's avatar
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba authored
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32da5386
    • Qu Wenruo's avatar
      btrfs: block-group: Reuse the item key from caller of read_one_block_group() · d49a2ddb
      Qu Wenruo authored
      For read_one_block_group(), its only caller has already got the item key
      to search next block group item.
      
      So we can use that key directly without doing our own convertion on
      stack.
      
      Also, since that key used in btrfs_read_block_groups() is vital for
      block group item search, add 'const' keyword for that parameter to
      prevent read_one_block_group() to modify it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d49a2ddb
    • Qu Wenruo's avatar
      btrfs: block-group: Refactor btrfs_read_block_groups() · ffb9e0f0
      Qu Wenruo authored
      Refactor the work inside the loop of btrfs_read_block_groups() into one
      separate function, read_one_block_group().
      
      This allows read_one_block_group to be reused for later BG_TREE feature.
      
      The refactor does the following extra fix:
      - Use btrfs_fs_incompat() to replace open-coded feature check
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffb9e0f0
    • David Sterba's avatar
      btrfs: document extent buffer locking · d4e253bb
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e253bb
    • David Sterba's avatar
      btrfs: access eb::blocking_writers according to ACCESS_ONCE policies · a4477988
      David Sterba authored
      A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
      once policies can be found here
      https://lwn.net/Articles/799218/#Access-Marking%20Policies .
      
      The locked and unlocked access to eb::blocking_writers should be
      annotated accordingly, following this:
      
      Writes:
      
      - locked write must use ONCE, may use plain read
      - unlocked write must use ONCE
      
      Reads:
      
      - unlocked read must use ONCE
      - locked read may use plain read iff not mixed with unlocked read
      - unlocked read then locked must use ONCE
      
      There's one difference on the assembly level, where
      btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
      value and did not reevaluate it after taking the lock. This could have
      missed some opportunities to take the lock in case blocking writers
      changed between the calls, but the window is just a few instructions
      long. As this is in try-lock, the callers handle that.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4477988
    • David Sterba's avatar
      btrfs: set blocking_writers directly, no increment or decrement · 40d38f53
      David Sterba authored
      The increment and decrement was inherited from previous version that
      used atomics, switched in commit 06297d8c ("btrfs: switch
      extent_buffer blocking_writers from atomic to int"). The only possible
      values are 0 and 1 so we can set them directly.
      
      The generated assembly (gcc 9.x) did the direct value assignment in
      btrfs_set_lock_blocking_write (asm diff after change in 06297d8c):
      
           5d:   test   %eax,%eax
           5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
           61:   retq
      
        -  62:   lock incl 0x44(%rdi)
        -  66:   add    $0x50,%rdi
        -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>
      
        +  62:   movl   $0x1,0x44(%rdi)
        +  69:   add    $0x50,%rdi
        +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>
      
      The part in btrfs_tree_unlock did a decrement because
      BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
      otherwise the output looks safe:
      
        - lock decl 0x44(%rdi)
      
        + sub    $0x1,%eax
        + mov    %eax,0x44(%rdi)
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40d38f53
    • David Sterba's avatar
      btrfs: merge blocking_writers branches in btrfs_tree_read_lock · f5c2a525
      David Sterba authored
      There are two ifs that use eb::blocking_writers. As this is a variable
      modified inside and outside of locks, we could minimize number of
      accesses to avoid problems with getting different results at different
      times.
      
      The access here is locked so this can only race with btrfs_tree_unlock
      that sets blocking_writers to 0 without lock and unsets the lock owner.
      
      The first branch is taken only if the same thread already holds the
      lock, the second if checks for blocking writers. Here we'd either unlock
      and wait, or proceed. Both are valid states of the locking protocol.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5c2a525