1. 28 May, 2019 6 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: Check bg while resuming relocation to avoid NULL pointer dereference · 57949d03
      Qu Wenruo authored
      [BUG]
      When mounting a fs with reloc tree and has qgroup enabled, it can cause
      NULL pointer dereference at mount time:
      
        BUG: kernel NULL pointer dereference, address: 00000000000000a8
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP NOPTI
        RIP: 0010:btrfs_qgroup_add_swapped_blocks+0x186/0x300 [btrfs]
        Call Trace:
         replace_path.isra.23+0x685/0x900 [btrfs]
         merge_reloc_root+0x26e/0x5f0 [btrfs]
         merge_reloc_roots+0x10a/0x1a0 [btrfs]
         btrfs_recover_relocation+0x3cd/0x420 [btrfs]
         open_ctree+0x1bc8/0x1ed0 [btrfs]
         btrfs_mount_root+0x544/0x680 [btrfs]
         legacy_get_tree+0x34/0x60
         vfs_get_tree+0x2d/0xf0
         fc_mount+0x12/0x40
         vfs_kern_mount.part.12+0x61/0xa0
         vfs_kern_mount+0x13/0x20
         btrfs_mount+0x16f/0x860 [btrfs]
         legacy_get_tree+0x34/0x60
         vfs_get_tree+0x2d/0xf0
         do_mount+0x81f/0xac0
         ksys_mount+0xbf/0xe0
         __x64_sys_mount+0x25/0x30
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      In btrfs_recover_relocation(), we don't have enough info to determine
      which block group we're relocating, but only to merge existing reloc
      trees.
      
      Thus in btrfs_recover_relocation(), rc->block_group is NULL.
      btrfs_qgroup_add_swapped_blocks() hasn't taken this into consideration,
      and causes a NULL pointer dereference.
      
      The bug is introduced by commit 3d0174f7 ("btrfs: qgroup: Only trace
      data extents in leaves if we're relocating data block group"), and
      later qgroup refactoring still keeps this optimization.
      
      [FIX]
      Thankfully in the context of btrfs_recover_relocation(), there is no
      other progress can modify tree blocks, thus those swapped tree blocks
      pair will never affect qgroup numbers, no matter whatever we set for
      block->trace_leaf.
      
      So we only need to check if @bg is NULL before accessing @bg->flags.
      Reported-by: default avatarJuan Erbes <jerbes@gmail.com>
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1134806
      Fixes: 3d0174f7 ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group")
      CC: stable@vger.kernel.org # 4.20+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57949d03
    • Qu Wenruo's avatar
      btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON() · 30d40577
      Qu Wenruo authored
      [BUG]
      When a fs has orphan reloc tree along with unfinished balance:
        ...
              item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439
                      generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<<
                      lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none)
                      uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46
              item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448
                      temporary item objectid BALANCE offset 0
                      balance status flags 14
      
      Then at mount time, we can hit the following kernel BUG_ON():
        BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/relocation.c:1413!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G           O      5.2.0-rc1-custom #15
        RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs]
        Call Trace:
         btrfs_init_reloc_root+0x96/0xb0 [btrfs]
         record_root_in_trans+0xb2/0xe0 [btrfs]
         btrfs_record_root_in_trans+0x55/0x70 [btrfs]
         select_reloc_root+0x7e/0x230 [btrfs]
         do_relocation+0xc4/0x620 [btrfs]
         relocate_tree_blocks+0x592/0x6a0 [btrfs]
         relocate_block_group+0x47b/0x5d0 [btrfs]
         btrfs_relocate_block_group+0x183/0x2f0 [btrfs]
         btrfs_relocate_chunk+0x4e/0xe0 [btrfs]
         btrfs_balance+0x864/0xfa0 [btrfs]
         balance_kthread+0x3b/0x50 [btrfs]
         kthread+0x123/0x140
         ret_from_fork+0x27/0x50
      
      [CAUSE]
      In btrfs, reloc trees are used to record swapped tree blocks during
      balance.
      Reloc tree either get merged (replace old tree blocks of its parent
      subvolume) in next transaction if its ref is 1 (fresh).
      Or is already merged and will be cleaned up if its ref is 0 (orphan).
      
      After commit d2311e69 ("btrfs: relocation: Delay reloc tree deletion
      after merge_reloc_roots"), reloc tree cleanup is delayed until one block
      group is balanced.
      
      Since fresh reloc roots are recorded during merge, as long as there
      is no power loss, those orphan reloc roots converted from fresh ones are
      handled without problem.
      
      However when power loss happens, orphan reloc roots can be recorded
      on-disk, thus at next mount time, we will have orphan reloc roots from
      on-disk data directly, and ignored by clean_dirty_subvols() routine.
      
      Then when background balance starts to balance another block group, and
      needs to create new reloc root for the same root, btrfs_insert_item()
      returns -EEXIST, and trigger that BUG_ON().
      
      [FIX]
      For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so
      all reloc roots no matter orphan or not, can be cleaned up properly and
      avoid above BUG_ON().
      
      And to cooperate with above change, clean_dirty_subvols() will check if
      the queued root is a reloc root or a subvol root.
      For a subvol root, do the old work, and for a orphan reloc root, clean it
      up.
      
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30d40577
    • Filipe Manana's avatar
      Btrfs: incremental send, fix emission of invalid clone operations · 3c850b45
      Filipe Manana authored
      When doing an incremental send we can now issue clone operations with a
      source range that ends at the source's file eof and with a destination
      range that ends at an offset smaller then the destination's file eof.
      If the eof of the source file is not aligned to the sector size of the
      filesystem, the receiver will get a -EINVAL error when trying to do the
      operation or, on older kernels, silently corrupt the destination file.
      The corruption happens on kernels without commit ac765f83
      ("Btrfs: fix data corruption due to cloning of eof block"), while the
      failure to clone happens on kernels with that commit.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
        $ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
        $ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
        $ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.send /mnt/sdb/base
      
        $ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
        $ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
        $ xfs_io -c "truncate 550K" /mnt/sdb/bar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.send /mnt/sdc
        $ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
        (...)
        truncate bar size=563200
        utimes bar
        clone zoo - source=bar source offset=512000 offset=0 length=51200
        ERROR: failed to clone extents to zoo
        Invalid argument
      
      The failure happens because the clone source range ends at the eof of file
      bar, 563200, which is not aligned to the filesystems sector size (4Kb in
      this case), and the destination range ends at offset 0 + 51200, which is
      less then the size of the file zoo (2Mb).
      
      So fix this by detecting such case and instead of issuing a clone
      operation for the whole range, do a clone operation for smaller range
      that is sector size aligned followed by a write operation for the block
      containing the eof. Here we will always be pessimistic and assume the
      destination filesystem of the send stream has the largest possible sector
      size (64Kb), since we have no way of determining it.
      
      This fixes a recent regression introduced in kernel 5.2-rc1.
      
      Fixes: 040ee612 ("Btrfs: send, improve clone range")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c850b45
    • Filipe Manana's avatar
      Btrfs: incremental send, fix file corruption when no-holes feature is enabled · 6b1f72e5
      Filipe Manana authored
      When using the no-holes feature, if we have a file with prealloc extents
      with a start offset beyond the file's eof, doing an incremental send can
      cause corruption of the file due to incorrect hole detection. Such case
      requires that the prealloc extent(s) exist in both the parent and send
      snapshots, and that a hole is punched into the file that covers all its
      extents that do not cross the eof boundary.
      
      Example reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
        $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.snap /mnt/sdb/base
      
        $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
      
        $ md5sum /mnt/sdb/incr/foobar
        816df6f64deba63b029ca19d880ee10a   /mnt/sdb/incr/foobar
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.snap /mnt/sdc
        $ btrfs receive -f /tmp/incr.snap /mnt/sdc
      
        $ md5sum /mnt/sdc/incr/foobar
        cf2ef71f4a9e90c2f6013ba3b2257ed2   /mnt/sdc/incr/foobar
      
          --> Different checksum, because the prealloc extent beyond the
              file's eof confused the hole detection code and it assumed
              a hole starting at offset 0 and ending at the offset of the
              prealloc extent (1200Kb) instead of ending at the offset
              500Kb (the file's size).
      
      Fix this by ensuring we never cross the file's size when issuing the
      write operations for a hole.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      CC: stable@vger.kernel.org # 3.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b1f72e5
    • Dennis Zhou's avatar
      btrfs: correct zstd workspace manager lock to use spin_lock_bh() · fee13fe9
      Dennis Zhou authored
      The btrfs zstd workspace manager uses a background timer to reclaim not
      recently used workspaces. I used spin_lock() from this context which
      should have been caught with lockdep, but was not. This deadlock was
      reported in bugzilla. The fix is to switch the zstd wsm lock to use
      spin_lock_bh() from the softirq context.
      
      This happened quite relibably on ppc64, unlike on other architectures.
      
        [  313.402874] ================================
        [  313.402875] WARNING: inconsistent lock state
        [  313.402879] 5.1.0-rc7 #1 Not tainted
        [  313.402880] --------------------------------
        [  313.402882] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        [  313.402885] swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        [  313.402888] 0000000080d1120c (&(&wsm.lock)->rlock){+.?.}, at: .zstd_reclaim_timer_fn+0x40/0x230
        [  313.402895] {SOFTIRQ-ON-W} state was registered at:
        [  313.402899]   .lock_acquire+0xd0/0x240
        [  313.402903]   ._raw_spin_lock+0x34/0x60
        [  313.402906]   .zstd_get_workspace+0xd0/0x360
        [  313.402908]   .end_compressed_bio_read+0x3b8/0x540
        [  313.402911]   .bio_endio+0x174/0x2c0
        [  313.402914]   .end_workqueue_fn+0x4c/0x70
        [  313.402917]   .normal_work_helper+0x138/0x7e0
        [  313.402920]   .process_one_work+0x324/0x790
        [  313.402922]   .worker_thread+0x68/0x570
        [  313.402925]   .kthread+0x19c/0x1b0
        [  313.402928]   .ret_from_kernel_thread+0x58/0x78
        [  313.402930] irq event stamp: 2629216
        [  313.402933] hardirqs last  enabled at (2629216): [<c0000000009da738>] ._raw_spin_unlock_irq+0x38/0x60
        [  313.402936] hardirqs last disabled at (2629215): [<c0000000009da4c4>] ._raw_spin_lock_irq+0x24/0x70
        [  313.402939] softirqs last  enabled at (2629212): [<c0000000000af9fc>] .irq_enter+0x8c/0xd0
        [  313.402942] softirqs last disabled at (2629213): [<c0000000000afb58>] .irq_exit+0x118/0x170
        [  313.402944]
      		 other info that might help us debug this:
        [  313.402945]  Possible unsafe locking scenario:
      
        [  313.402947]        CPU0
        [  313.402948]        ----
        [  313.402949]   lock(&(&wsm.lock)->rlock);
        [  313.402951]   <Interrupt>
        [  313.402952]     lock(&(&wsm.lock)->rlock);
        [  313.402954]
      		  *** DEADLOCK ***
      
        [  313.402957] 1 lock held by swapper/5/0:
        [  313.402958]  #0: 000000004b612042 ((&wsm.timer)){+.-.}, at: .call_timer_fn+0x0/0x3c0
        [  313.402963]
      		 stack backtrace:
        [  313.402967] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.1.0-rc7 #1
        [  313.402968] Call Trace:
        [  313.402972] [c0000007fa262e70] [c0000000009b3294] .dump_stack+0xe0/0x15c (unreliable)
        [  313.402975] [c0000007fa262f10] [c000000000125548] .print_usage_bug+0x348/0x390
        [  313.402978] [c0000007fa262fd0] [c000000000125cb4] .mark_lock+0x724/0x930
        [  313.402981] [c0000007fa263080] [c000000000126c20] .__lock_acquire+0xc90/0x16a0
        [  313.402984] [c0000007fa2631b0] [c000000000128040] .lock_acquire+0xd0/0x240
        [  313.402987] [c0000007fa263280] [c0000000009da2b4] ._raw_spin_lock+0x34/0x60
        [  313.402990] [c0000007fa263300] [c00000000054b0b0] .zstd_reclaim_timer_fn+0x40/0x230
        [  313.402993] [c0000007fa2633d0] [c000000000158b38] .call_timer_fn+0xc8/0x3c0
        [  313.402996] [c0000007fa2634a0] [c000000000158f74] .expire_timers+0x144/0x260
        [  313.402999] [c0000007fa263550] [c000000000159178] .run_timer_softirq+0xe8/0x230
        [  313.403002] [c0000007fa263680] [c0000000009db288] .__do_softirq+0x188/0x5d4
        [  313.403004] [c0000007fa263790] [c0000000000afb58] .irq_exit+0x118/0x170
        [  313.403008] [c0000007fa263800] [c000000000028d88] .timer_interrupt+0x158/0x430
        [  313.403012] [c0000007fa2638b0] [c0000000000091d4] decrementer_common+0x134/0x140
        [  313.403017] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
      		     LR = .arch_local_irq_restore.part.0+0x68/0x80
        [  313.403020] [c0000007fa263bb0] [c00000000001a3ac] .arch_local_irq_restore.part.0+0x2c/0x80 (unreliable)
        [  313.403024] [c0000007fa263c30] [c0000000007bbbcc] .cpuidle_enter_state+0xec/0x670
        [  313.403027] [c0000007fa263d00] [c0000000000f5130] .call_cpuidle+0x40/0x90
        [  313.403031] [c0000007fa263d70] [c0000000000f554c] .do_idle+0x2dc/0x3a0
        [  313.403034] [c0000007fa263e30] [c0000000000f59ac] .cpu_startup_entry+0x2c/0x30
        [  313.403037] [c0000007fa263ea0] [c000000000045674] .start_secondary+0x644/0x650
        [  313.403041] [c0000007fa263f90] [c00000000000ad5c] start_secondary_prolog+0x10/0x14
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203517
      Fixes: 3f93aef5 ("btrfs: add zstd compression level support")
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fee13fe9
    • Nikolay Borisov's avatar
      btrfs: Ensure replaced device doesn't have pending chunk allocation · debd1c06
      Nikolay Borisov authored
      Recent FITRIM work, namely bbbf7243 ("btrfs: combine device update
      operations during transaction commit") combined the way certain
      operations are recoded in a transaction. As a result an ASSERT was added
      in dev_replace_finish to ensure the new code works correctly.
      Unfortunately I got reports that it's possible to trigger the assert,
      meaning that during a device replace it's possible to have an unfinished
      chunk allocation on the source device.
      
      This is supposed to be prevented by the fact that a transaction is
      committed before finishing the replace oepration and alter acquiring the
      chunk mutex. This is not sufficient since by the time the transaction is
      committed and the chunk mutex acquired it's possible to allocate a chunk
      depending on the workload being executed on the replaced device. This
      bug has been present ever since device replace was introduced but there
      was never code which checks for it.
      
      The correct way to fix is to ensure that there is no pending device
      modification operation when the chunk mutex is acquire and if there is
      repeat transaction commit. Unfortunately it's not possible to just
      exclude the source device from btrfs_fs_devices::dev_alloc_list since
      this causes ENOSPC to be hit in transaction commit.
      
      Fixing that in another way would need to add special cases to handle the
      last writes and forbid new ones. The looped transaction fix is more
      obvious, and can be easily backported. The runtime of dev-replace is
      long so there's no noticeable delay caused by that.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: 391cd9df ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      debd1c06
  2. 16 May, 2019 6 commits
    • Filipe Manana's avatar
      Btrfs: tree-checker: detect file extent items with overlapping ranges · 4e9845ef
      Filipe Manana authored
      Having file extent items with ranges that overlap each other is a
      serious issue that leads to all sorts of corruptions and crashes (like a
      BUG_ON() during the course of __btrfs_drop_extents() when it traims file
      extent items). Therefore teach the tree checker to detect such cases.
      This is motivated by a recently fixed bug (race between ranged full
      fsync and writeback or adjacent ranges).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e9845ef
    • Filipe Manana's avatar
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · 0c713cba
      Filipe Manana authored
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c713cba
    • Filipe Manana's avatar
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · ebb92906
      Filipe Manana authored
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebb92906
    • Qu Wenruo's avatar
      btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes · 14ae4ec1
      Qu Wenruo authored
      Commit ddf30cf0 ("btrfs: extent-tree: Use btrfs_ref to refactor
      add_pinned_bytes()") refactored add_pinned_bytes(), but during that
      refactor, there are two callers which add the pinned bytes instead
      of subtracting.
      
      That refactor misses those two caller, causing incorrect pinned bytes
      calculation and resulting unexpected ENOSPC error.
      
      Fix it by adding a new parameter @sign to restore the original behavior.
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Fixes: ddf30cf0 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14ae4ec1
    • Tobin C. Harding's avatar
      btrfs: sysfs: don't leak memory when failing add fsid · e3277335
      Tobin C. Harding authored
      A failed call to kobject_init_and_add() must be followed by a call to
      kobject_put().  Currently in the error path when adding fs_devices we
      are missing this call.  This could be fixed by calling
      btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
      by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
      Here we choose the second option because it prevents the slightly
      unusual error path handling requirements of kobject from leaking out
      into btrfs functions.
      
      Add a call to kobject_put() in the error path of kobject_add_and_init().
      This causes the release method to be called if kobject_init_and_add()
      fails.  open_tree() is the function that calls btrfs_sysfs_add_fsid()
      and the error code in this function is already written with the
      assumption that the release method is called during the error path of
      open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
      fail_fsdev_sysfs label).
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3277335
    • Tobin C. Harding's avatar
      btrfs: sysfs: Fix error path kobject memory leak · 450ff834
      Tobin C. Harding authored
      If a call to kobject_init_and_add() fails we must call kobject_put()
      otherwise we leak memory.
      
      Calling kobject_put() when kobject_init_and_add() fails drops the
      refcount back to 0 and calls the ktype release method (which in turn
      calls the percpu destroy and kfree).
      
      Add call to kobject_put() in the error path of call to
      kobject_init_and_add().
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      450ff834
  3. 09 May, 2019 2 commits
    • Filipe Manana's avatar
      Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path · 72bd2323
      Filipe Manana authored
      Currently when we fail to COW a path at btrfs_update_root() we end up
      always aborting the transaction. However all the current callers of
      btrfs_update_root() are able to deal with errors returned from it, many do
      end up aborting the transaction themselves (directly or not, such as the
      transaction commit path), other BUG_ON() or just gracefully cancel whatever
      they were doing.
      
      When syncing the fsync log, we call btrfs_update_root() through
      tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
      sync code does not abort the transaction, instead it gracefully handles
      the error and returns -EAGAIN to the fsync handler, so that it falls back
      to a transaction commit. Any other error different from -ENOSPC, makes the
      log sync code abort the transaction.
      
      So remove the transaction abort from btrfs_update_log() when we fail to
      COW a path to update the root item, so that if an -ENOSPC failure happens
      we avoid aborting the current transaction and have a chance of the fsync
      succeeding after falling back to a transaction commit.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Cc: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72bd2323
    • Josef Bacik's avatar
      btrfs: use the existing reserved items for our first prop for inheritance · d7400ee1
      Josef Bacik authored
      We're now reserving an extra items worth of space for property
      inheritance.  We only have one property at the moment so this covers us,
      but if we add more in the future this will allow us to not get bitten by
      the extra space reservation.  If we do add more properties in the future
      we should re-visit how we calculate the space reservation needs by the
      callers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ refreshed on top of prop/xattr cleanups ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7400ee1
  4. 03 May, 2019 2 commits
    • Josef Bacik's avatar
      btrfs: don't double unlock on error in btrfs_punch_hole · 8fca9550
      Josef Bacik authored
      If we have an error writing out a delalloc range in
      btrfs_punch_hole_lock_range we'll unlock the inode and then goto
      out_only_mutex, where we will again unlock the inode.  This is bad,
      don't do this.
      
      Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fca9550
    • Johnny Chang's avatar
      btrfs: Check the compression level before getting a workspace · 2b90883c
      Johnny Chang authored
      When a file's compression property is set as zlib or zstd but leave
      the compression mount option not be set, that means btrfs will try
      to compress the file with default compression level. But in
      btrfs_compress_pages(), it calls get_workspace() with level = 0.
      This will return a workspace with a wrong compression level.
      For zlib, the compression level in the workspace will be 0
      (that means "store only"). And for zstd, the compression in the
      workspace will be 1, not the default level 3.
      
      How to reproduce:
        mkfs -t btrfs /dev/sdb
        mount /dev/sdb /mnt/
        mkdir /mnt/zlib
        btrfs property set /mnt/zlib/ compression zlib
        dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
        sync
        btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
      
      btrfs-debugfs output:
      * before:
        ...
        (258 9961472): ram 524288 disk 1106247680 disk_size 524288
        file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
      
      * after:
       ...
       (258 10354688): ram 131072 disk 14217216 disk_size 4096
       file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
      
      The steps for zstd are similar, but need to put a debugging message to
      show the level of the return workspace in zstd_get_workspace().
      
      This commit adds a check of the compression level before getting a
      workspace by set_level().
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarJohnny Chang <johnnyc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b90883c
  5. 02 May, 2019 9 commits
  6. 29 Apr, 2019 15 commits
    • Josef Bacik's avatar
      btrfs: track DIO bytes in flight · 4297ff84
      Josef Bacik authored
      When diagnosing a slowdown of generic/224 I noticed we were not doing
      anything when calling into shrink_delalloc().  This is because all
      writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
      counter is 0, which short circuits most of the work inside of
      shrink_delalloc().  However O_DIRECT writes still consume metadata
      resources and generate ordered extents, which we can still wait on.
      
      Fix this by tracking outstanding DIO write bytes, and use this as well
      as the delalloc bytes counter to decide if we need to lookup and wait on
      any ordered extents.  If we have more DIO writes than delalloc bytes
      we'll go ahead and wait on any ordered extents regardless of our flush
      state as flushing delalloc is likely to not gain us anything.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ use dio instead of odirect in identifiers ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4297ff84
    • Anand Jain's avatar
      btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop · da9b6ec8
      Anand Jain authored
      Since now the trans argument is never NULL in btrfs_set_prop we don't
      have to check. So delete it and use btrfs_setxattr that makes use of
      that.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da9b6ec8
    • Anand Jain's avatar
      btrfs: delete unused function btrfs_set_prop_trans · 717ebdc3
      Anand Jain authored
      The last consumer of btrfs_set_prop_trans() was taken away by the patch
      ("btrfs: start transaction in xattr_handler_set_prop") so now this
      function can be deleted.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      717ebdc3
    • Anand Jain's avatar
      btrfs: start transaction in xattr_handler_set_prop · b3f6a4be
      Anand Jain authored
      btrfs specific extended attributes on the inode are set using
      btrfs_xattr_handler_set_prop(), and the required transaction for this
      update is started by btrfs_setxattr(). For better visibility of the
      transaction start and end, do this in btrfs_xattr_handler_set_prop().
      For which this patch copied code of btrfs_setxattr() as it is in the
      original, which needs proper error handling.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3f6a4be
    • Anand Jain's avatar
      btrfs: drop local copy of inode i_mode · 44e5194b
      Anand Jain authored
      There isn't real use of making struct inode::i_mode a local copy, it
      saves a dereference one time, not much. Just use it directly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44e5194b
    • Anand Jain's avatar
      btrfs: drop old_fsflags in btrfs_ioctl_setflags · 3c8d8b63
      Anand Jain authored
      btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only
      once. Instead used it directly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c8d8b63
    • Anand Jain's avatar
      btrfs: modify local copy of btrfs_inode flags · d2b8fcfe
      Anand Jain authored
      Instead of updating the binode::flags directly, update a local copy, and
      then at the point of no error, store copy it to the binode::flags.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2b8fcfe
    • Anand Jain's avatar
      btrfs: drop useless inode i_flags copy and restore · 11d3cd5c
      Anand Jain authored
      The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used
      btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the
      inode::i_flags update functions such as
      btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called
      in btrfs_ioctl_setflags() instead of
      btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the
      inode::i_flags remains unmodified until the thread has checked all the
      conditions. So drop the saved inode::i_flags in out_i_flags.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11d3cd5c
    • Anand Jain's avatar
      btrfs: start transaction in btrfs_ioctl_setflags() · ff9fef55
      Anand Jain authored
      Inode attribute can be set through the FS_IOC_SETFLAGS ioctl.  This
      flags also includes compression attribute for which we would set/reset
      the compression extended attribute. While doing this there is a bit of
      duplicate code, the following things happens twice:
      
      - start/end_transaction
      - inode_inc_iversion()
      - current_time update to inode->i_ctime
      - and btrfs_update_inode()
      
      These are updated both at btrfs_ioctl_setflags() and btrfs_set_props()
      as well.  This patch merges these two duplicate codes at
      btrfs_ioctl_setflags().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff9fef55
    • Anand Jain's avatar
      btrfs: export btrfs_set_prop · cd31af15
      Anand Jain authored
      Make btrfs_set_prop() a non-static function, so that it can be called
      from btrfs_ioctl_setflags(). We need btrfs_set_prop() instead of
      btrfs_set_prop_trans() so that we can use the transaction which is
      already started in the current thread.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd31af15
    • Anand Jain's avatar
      btrfs: refactor btrfs_set_props to validate externally · f22125e5
      Anand Jain authored
      In preparation to merge multiple transactions when setting the
      compression flags, split btrfs_set_props() validation part outside of
      it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f22125e5
    • Qu Wenruo's avatar
      btrfs: ctree: Dump the leaf before BUG_ON in btrfs_set_item_key_safe · 7c15d410
      Qu Wenruo authored
      We have a long standing problem with reversed keys that's detected by
      btrfs_set_item_key_safe. This is hard to reproduce so we'd like to
      capture more information for later analysis.
      
      Let's dump the leaf content before triggering BUG_ON() so that we can
      have some clue on what's going wrong.  The output of tree locks should
      help us to debug such problem.
      
      Sample stacktrace:
      
       generic/522             [00:07:05]
       [26946.113381] run fstests generic/522 at 2019-04-16 00:07:05
       [27161.474720] kernel BUG at fs/btrfs/ctree.c:3192!
       [27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP
       [27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G        W         5.1.0-rc5-default+ #562
       [27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
       [27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.493845] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.495608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
       [27161.498100] Call Trace:
       [27161.498771]  __btrfs_drop_extents+0x6ec/0xdf0 [btrfs]
       [27161.499872]  btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs]
       [27161.501114]  btrfs_log_inode+0x7ff/0xdc0 [btrfs]
       [27161.502114]  ? __mutex_unlock_slowpath+0x4b/0x2b0
       [27161.503172]  btrfs_log_inode_parent+0x237/0x9c0 [btrfs]
       [27161.504348]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [27161.505374]  btrfs_sync_file+0x1b7/0x480 [btrfs]
       [27161.506371]  __x64_sys_msync+0x180/0x210
       [27161.507208]  do_syscall_64+0x54/0x180
       [27161.507932]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [27161.508839] RIP: 0033:0x7f8ea5aa9c61
       [27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       [27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61
       [27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000
       [27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000
       [27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938
       [27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000
       [27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug]
       [27161.522272] ---[ end trace d5afec7ccac6a252 ]---
       [27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.534405] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.536048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c15d410
    • Qu Wenruo's avatar
      btrfs: tree-checker: Allow error injection for tree-checker · 02529d7a
      Qu Wenruo authored
      Allowing error injection for btrfs_check_leaf_full() and
      btrfs_check_node() is useful to test the failure path of btrfs write
      time tree check.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02529d7a
    • Nikolay Borisov's avatar
      btrfs: Document btrfs_csum_one_bio · 51d470ae
      Nikolay Borisov authored
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51d470ae
    • Filipe Manana's avatar
      Btrfs: improve performance on fsync of files with multiple hardlinks · b8aa330d
      Filipe Manana authored
      Commit 41bd6067 ("Btrfs: fix fsync of files with multiple hard links
      in new directories") introduced a path that makes fsync fallback to a full
      transaction commit in order to avoid losing hard links and new ancestors
      of the fsynced inode. That path is triggered only when the inode has more
      than one hard link and either has a new hard link created in the current
      transaction or the inode was evicted and reloaded in the current
      transaction.
      
      That path ends up getting triggered very often (hundreds of times) during
      the course of pgbench benchmarks, resulting in performance drops of about
      20%.
      
      This change restores the performance by not triggering the full transaction
      commit in those cases, and instead iterate the fs/subvolume tree in search
      of all possible new ancestors, for all hard links, to log them.
      Reported-by: default avatarZhao Yuhu <zyuhu@suse.com>
      Tested-by: default avatarJames Wang <jnwang@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8aa330d