1. 28 May, 2020 7 commits
  2. 25 May, 2020 33 commits
    • Goldwyn Rodrigues's avatar
      iomap: remove lockdep_assert_held() · 3ad99bec
      Goldwyn Rodrigues authored
      Filesystems such as btrfs can perform direct I/O without holding the
      inode->i_rwsem in some of the cases like writing within i_size.  So,
      remove the check for lockdep_assert_held() in iomap_dio_rw().
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ad99bec
    • Goldwyn Rodrigues's avatar
      iomap: add a filesystem hook for direct I/O bio submission · 8cecd0ba
      Goldwyn Rodrigues authored
      This helps filesystems to perform tasks on the bio while submitting for
      I/O. This could be post-write operations such as data CRC or data
      replication for fs-handled RAID.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cecd0ba
    • Goldwyn Rodrigues's avatar
      fs: export generic_file_buffered_read() · d85dc2e1
      Goldwyn Rodrigues authored
      Export generic_file_buffered_read() to be used to supplement incomplete
      direct reads.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d85dc2e1
    • Filipe Manana's avatar
      btrfs: turn space cache writeout failure messages into debug messages · bbcd1f4d
      Filipe Manana authored
      Since commit 1afb648e ("btrfs: use standard debug config option to
      enable free-space-cache debug prints"), we started to log error messages
      that were never logged before since there was no DEBUG macro defined
      anywhere. This started to make test case btrfs/187 to fail very often,
      as it greps for any btrfs error messages in dmesg/syslog and fails if
      any is found:
      
      (...)
      btrfs/186 1s ...  2s
      btrfs/187       - output mismatch (see .../results//btrfs/187.out.bad)
          \--- tests/btrfs/187.out     2019-05-17 12:48:32.537340749 +0100
          \+++ /home/fdmanana/git/hub/xfstests/results//btrfs/187.out.bad ...
          \@@ -1,3 +1,8 @@
           QA output created by 187
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap1'
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap2'
          +[268364.139958] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.156503] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.161703] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.253180] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          ...
          (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/187.out ...
      btrfs/188 4s ...  2s
      (...)
      
      The space cache write failures happen due to ENOSPC when attempting to
      update the free space cache items in the root tree. This happens because
      when starting or joining a transaction we don't know how many block
      groups we will end up changing (due to extent allocation or release) and
      therefore never reserve space for updating free space cache items.
      More often than not, the free space cache writeout succeeds since the
      metadata space info is not yet full nor very close to being full, but
      when it is, the space cache writeout fails with ENOSPC.
      
      Occasional failures to write space caches are not considered critical
      since they can be rebuilt when mounting the filesystem or the next
      attempt to write a free space cache in the next transaction commit might
      succeed, so we used to hide those error messages with a preprocessor
      check for the existence of the DEBUG macro that was never enabled
      anywhere.
      
      A few other generic test cases also trigger the error messages due to
      ENOSPC failure when writing free space caches as well, however they don't
      fail since they don't grep dmesg/syslog for any btrfs specific error
      messages.
      
      So change the messages from 'error' level to 'debug' level, as it doesn't
      make much sense to have error messages triggered only if the debug macro
      is enabled plus, more importantly, the error is not serious nor highly
      unexpected.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbcd1f4d
    • Filipe Manana's avatar
      btrfs: include error on messages about failure to write space/inode caches · 2e69a7a6
      Filipe Manana authored
      Currently the error messages logged when we fail to write a free space
      cache or an inode cache are not very useful as they don't mention what
      was the error. So include the error number in the messages.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e69a7a6
    • Filipe Manana's avatar
      btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks() · 918cdf44
      Filipe Manana authored
      The label 'fail_unlock' is pointless, all it does is to jump to the label
      'out', so just remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      918cdf44
    • Filipe Manana's avatar
      btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums · 7e4a3f7e
      Filipe Manana authored
      We are currently treating any non-zero return value from btrfs_next_leaf()
      the same way, by going to the code that inserts a new checksum item in the
      tree. However if btrfs_next_leaf() returns an error (a value < 0), we
      should just stop and return the error, and not behave as if nothing has
      happened, since in that case we do not have a way to know if there is a
      next leaf or we are currently at the last leaf already.
      
      So fix that by returning the error from btrfs_next_leaf().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e4a3f7e
    • Filipe Manana's avatar
      btrfs: make checksum item extension more efficient · cc14600c
      Filipe Manana authored
      When we want to add checksums into the checksums tree, or a log tree, we
      try whenever possible to extend existing checksum items, as this helps
      reduce amount of metadata space used, since adding a new item uses extra
      metadata space for a btrfs_item structure (25 bytes).
      
      However we have two inefficiencies in the current approach:
      
      1) After finding a checksum item that covers a range with an end offset
         that matches the start offset of the checksum range we want to insert,
         we release the search path populated by btrfs_lookup_csum() and then
         do another COW search on tree with the goal of getting additional
         space for at least one checksum. Doing this path release and then
         searching again is a waste of time because very often the leaf already
         has enough free space for at least one more checksum;
      
      2) After the COW search that guarantees we get free space in the leaf for
         at least one more checksum, we end up not doing the extension of the
         previous checksum item, and fallback to insertion of a new checksum
         item, if the leaf doesn't have an amount of free space larger then the
         space required for 2 checksums plus one btrfs_item structure - this is
         pointless for two reasons:
      
         a) We want to extend an existing item, so we don't need to account for
            a btrfs_item structure (25 bytes);
      
         b) We made the COW search with an insertion size for 1 single checksum,
            so if the leaf ends up with a free space amount smaller then 2
            checksums plus the size of a btrfs_item structure, we give up on the
            extension of the existing item and jump to the 'insert' label, where
            we end up releasing the path and then doing yet another search to
            insert a new checksum item for a single checksum.
      
      Fix these inefficiencies by doing the following:
      
      - For case 1), before releasing the path just check if the leaf already
        has enough space for at least 1 more checksum, and if it does, jump
        directly to the item extension code, with releasing our current path,
        which was already COWed by btrfs_lookup_csum();
      
      - For case 2), fix the logic so that for item extension we require only
        that the leaf has enough free space for 1 checksum, and not a minimum
        of 2 checksums plus space for a btrfs_item structure.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc14600c
    • Filipe Manana's avatar
      btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents · e289f03e
      Filipe Manana authored
      When we have extents shared amongst different inodes in the same subvolume,
      if we fsync them in parallel we can end up with checksum items in the log
      tree that represent ranges which overlap.
      
      For example, consider we have inodes A and B, both sharing an extent that
      covers the logical range from X to X + 64KiB:
      
      1) Task A starts an fsync on inode A;
      
      2) Task B starts an fsync on inode B;
      
      3) Task A calls btrfs_csum_file_blocks(), and the first search in the
         log tree, through btrfs_lookup_csum(), returns -EFBIG because it
         finds an existing checksum item that covers the range from X - 64KiB
         to X;
      
      4) Task A checks that the checksum item has not reached the maximum
         possible size (MAX_CSUM_ITEMS) and then releases the search path
         before it does another path search for insertion (through a direct
         call to btrfs_search_slot());
      
      5) As soon as task A releases the path and before it does the search
         for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
         too, because there is an existing checksum item that has an end
         offset that matches the start offset (X) of the checksum range we want
         to log;
      
      6) Task B releases the path;
      
      7) Task A does the path search for insertion (through btrfs_search_slot())
         and then verifies that the checksum item that ends at offset X still
         exists and extends its size to insert the checksums for the range from
         X to X + 64KiB;
      
      8) Task A releases the path and returns from btrfs_csum_file_blocks(),
         having inserted the checksums into an existing checksum item that got
         its size extended. At this point we have one checksum item in the log
         tree that covers the logical range from X - 64KiB to X + 64KiB;
      
      9) Task B now does a search for insertion using btrfs_search_slot() too,
         but it finds that the previous checksum item no longer ends at the
         offset X, it now ends at an of offset X + 64KiB, so it leaves that item
         untouched.
      
         Then it releases the path and calls btrfs_insert_empty_item()
         that inserts a checksum item with a key offset corresponding to X and
         a size for inserting a single checksum (4 bytes in case of crc32c).
         Subsequent iterations end up extending this new checksum item so that
         it contains the checksums for the range from X to X + 64KiB.
      
         So after task B returns from btrfs_csum_file_blocks() we end up with
         two checksum items in the log tree that have overlapping ranges, one
         for the range from X - 64KiB to X + 64KiB, and another for the range
         from X to X + 64KiB.
      
      Having checksum items that represent ranges which overlap, regardless of
      being in the log tree or in the chekcsums tree, can lead to problems where
      checksums for a file range end up not being found. This type of problem
      has happened a few times in the past and the following commits fixed them
      and explain in detail why having checksum items with overlapping ranges is
      problematic:
      
        27b9a812 "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
        b84b8390 "Btrfs: fix file read corruption after extent cloning and fsync"
        40e046ac "Btrfs: fix missing data checksums after replaying a log tree"
      
      Since this specific instance of the problem can only happen when logging
      inodes, because it is the only case where concurrent attempts to insert
      checksums for the same range can happen, fix the issue by using an extent
      io tree as a range lock to serialize checksum insertion during inode
      logging.
      
      This issue could often be reproduced by the test case generic/457 from
      fstests. When it happens it produces the following trace:
      
       BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
       BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
       BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
            item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
            item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
            item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
            item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
            item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
            item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
       (...)
       BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Modules linked in: btrfs dm_thin_pool ...
       CPU: 1 PID: 15884 Comm: fsx Tainted: G        W         5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Code: c7 c7 ...
       RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
       RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
       RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
       R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
       FS:  00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btree_submit_bio_hook+0x67/0xc0 [btrfs]
        submit_one_bio+0x31/0x50 [btrfs]
        btree_write_cache_pages+0x2db/0x4b0 [btrfs]
        ? __filemap_fdatawrite_range+0xb1/0x110
        do_writepages+0x23/0x80
        __filemap_fdatawrite_range+0xd2/0x110
        btrfs_write_marked_extents+0x15e/0x180 [btrfs]
        btrfs_sync_log+0x206/0x10a0 [btrfs]
        ? kmem_cache_free+0x315/0x3b0
        ? btrfs_log_inode+0x1e8/0xf90 [btrfs]
        ? __mutex_unlock_slowpath+0x45/0x2a0
        ? lockref_put_or_lock+0x9/0x30
        ? dput+0x2d/0x580
        ? dput+0xb5/0x580
        ? btrfs_sync_file+0x464/0x4d0 [btrfs]
        btrfs_sync_file+0x464/0x4d0 [btrfs]
        do_fsync+0x38/0x60
        __x64_sys_fsync+0x10/0x20
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fb41953a6d0
       Code: 48 3d ...
       RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
       RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
       RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
       R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
       R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace d543fc76f5ad7fd8 ]---
      
      In that trace the tree checker detected the overlapping checksum items at
      the time when we triggered writeback for the log tree when syncing the
      log.
      
      Another trace that can happen is due to BUG_ON() when deleting checksum
      items while logging an inode:
      
       BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
       BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
       BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
        item 0 key (257 1 0) itemoff 16123 itemsize 160
                inode generation 7 size 262144 mode 100600
        item 1 key (257 12 256) itemoff 16103 itemsize 20
        item 2 key (257 108 0) itemoff 16050 itemsize 53
                extent data disk bytenr 13631488 nr 4096
                extent data offset 0 nr 131072 ram 131072
       (...)
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:3153!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
       Code: 0f b6 ...
       RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
       RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
       RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
       R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
       R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
       FS:  00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_del_csums+0x2f4/0x540 [btrfs]
        copy_items+0x4b5/0x560 [btrfs]
        btrfs_log_inode+0x910/0xf90 [btrfs]
        btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        ? dget_parent+0x5/0x370
        btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        btrfs_sync_file+0x42b/0x4d0 [btrfs]
        __x64_sys_msync+0x199/0x200
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fe586c65760
       Code: 00 f7 ...
       RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
       RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
       RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
       R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
       R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
       Modules linked in: dm_log_writes ...
       ---[ end trace c92a7f447a8515f5 ]---
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e289f03e
    • Anand Jain's avatar
      btrfs: unexport btrfs_compress_set_level() · adbab642
      Anand Jain authored
      btrfs_compress_set_level() can be static function in the file
      compression.c.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adbab642
    • David Sterba's avatar
      btrfs: simplify iget helpers · 0202e83f
      David Sterba authored
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0202e83f
    • David Sterba's avatar
      btrfs: open code read_fs_root · a820feb5
      David Sterba authored
      After the update to btrfs_get_fs_root, read_fs_root has become trivial
      wrapper that can be open coded.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a820feb5
    • David Sterba's avatar
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba authored
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56e9357a
    • Qu Wenruo's avatar
      btrfs: reloc: clear DEAD_RELOC_TREE bit for orphan roots to prevent runaway balance · 1dae7e0e
      Qu Wenruo authored
      [BUG]
      There are several reported runaway balance, that balance is flooding the
      log with "found X extents" where the X never changes.
      
      [CAUSE]
      Commit d2311e69 ("btrfs: relocation: Delay reloc tree deletion after
      merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to
      indicate that one subvolume has finished its tree blocks swap with its
      reloc tree.
      
      However if balance is canceled or hits ENOSPC halfway, we didn't clear
      the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever
      until unmount.
      
      Any subvolume root with that bit, would cause backref cache to skip this
      tree block, as it has finished its tree block swap.  This would cause
      all tree blocks of that root be ignored by balance, leading to runaway
      balance.
      
      [FIX]
      Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for
      the original subvolume of orphan reloc root.
      
      Add an umount check for the stale bit still set.
      
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1dae7e0e
    • Qu Wenruo's avatar
      btrfs: reloc: fix reloc root leak and NULL pointer dereference · 51415b6c
      Qu Wenruo authored
      [BUG]
      When balance is canceled, there is a pretty high chance that unmounting
      the fs can lead to lead the NULL pointer dereference:
      
        BTRFS warning (device dm-3): page private not zero on page 223158272
        ...
        BTRFS warning (device dm-3): page private not zero on page 223162368
        BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1
        BUG: kernel NULL pointer dereference, address: 0000000000000168
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 2 PID: 5793 Comm: umount Tainted: G           O      5.7.0-rc5-custom+ #53
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__lock_acquire+0x5dc/0x24c0
        Call Trace:
         lock_acquire+0xab/0x390
         _raw_spin_lock+0x39/0x80
         btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs]
         release_extent_buffer+0xb2/0x170 [btrfs]
         free_extent_buffer+0x66/0xb0 [btrfs]
         btrfs_put_root+0x8e/0x130 [btrfs]
         btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs]
         btrfs_free_fs_info+0xe5/0x120 [btrfs]
         btrfs_kill_super+0x1f/0x30 [btrfs]
         deactivate_locked_super+0x3b/0x80
         deactivate_super+0x3e/0x50
         cleanup_mnt+0x109/0x160
         __cleanup_mnt+0x12/0x20
         task_work_run+0x67/0xa0
         exit_to_usermode_loop+0xc5/0xd0
         syscall_return_slowpath+0x205/0x360
         do_syscall_64+0x6e/0xb0
         entry_SYSCALL_64_after_hwframe+0x49/0xb3
        RIP: 0033:0x7fd028ef740b
      
      [CAUSE]
      When balance is canceled, all reloc roots are marked as orphan, and
      orphan reloc roots are going to be cleaned up.
      
      However for orphan reloc roots and merged reloc roots, their lifespan
      are quite different:
      
      	Merged reloc roots	|	Orphan reloc roots by cancel
      --------------------------------------------------------------------
      create_reloc_root()		| create_reloc_root()
      |- refs == 1			| |- refs == 1
      				|
      btrfs_grab_root(reloc_root);	| btrfs_grab_root(reloc_root);
      |- refs == 2			| |- refs == 2
      				|
      root->reloc_root = reloc_root;	| root->reloc_root = reloc_root;
      		>>> No difference so far <<<
      				|
      prepare_to_merge()		| prepare_to_merge()
      |- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR)
      				|
      merge_reloc_roots()		| merge_reloc_roots()
      |- merge_reloc_root()		| |- Doing nothing to put reloc root
         |- insert_dirty_subvol()	| |- refs == 2
            |- __del_reloc_root()	|
               |- btrfs_put_root()	|
                  |- refs == 1	|
      		>>> Now orphan reloc roots still have refs 2 <<<
      				|
      clean_dirty_subvols()		| clean_dirty_subvols()
      |- btrfs_drop_snapshot()	| |- btrfS_drop_snapshot()
         |- reloc_root get freed	|    |- reloc_root still has refs 2
      				|	related ebs get freed, but
      				|	reloc_root still recorded in
      				|	allocated_roots
      btrfs_check_leaked_roots()	| btrfs_check_leaked_roots()
      |- No leaked roots		| |- Leaked reloc_roots detected
      				| |- btrfs_put_root()
      				|    |- free_extent_buffer(root->node);
      				|       |- eb already freed, caused NULL
      				|	   pointer dereference
      
      [FIX]
      The fix is to clear fs_root->reloc_root and put it at
      merge_reloc_roots() time, so that we won't leak reloc roots.
      
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1+
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51415b6c
    • Robbie Ko's avatar
      btrfs: reduce lock contention when creating snapshot · c11fbb6e
      Robbie Ko authored
      When creating a snapshot, ordered extents need to be flushed and this
      can take a long time.
      
      In create_snapshot there are two locks held when this happens:
      
        1. Destination directory inode lock
        2. Global subvolume semaphore
      
      This will unnecessarily block other operations like subvolume destroy,
      create, or setflag until the snapshot is created.
      
      We can fix that by moving the flush outside the locked section as this
      does not depend on the aforementioned locks.  The code factors out the
      snapshot related work from create_snapshot to btrfs_mksnapshot.
      
      __btrfs_ioctl_snap_create
        btrfs_mksubvol
          create_subvol
        btrfs_mksnapshot
          <flush>
          btrfs_mksubvol
            create_snapshot
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c11fbb6e
    • Qu Wenruo's avatar
      btrfs: don't set SHAREABLE flag for data reloc tree · aeb935a4
      Qu Wenruo authored
      SHAREABLE flag is set for subvolumes because users can create snapshot
      for subvolumes, thus sharing tree blocks of them.
      
      But data reloc tree is not exposed to user space, as it's only an
      internal tree for data relocation, thus it doesn't need the full path
      replacement handling at all.
      
      This patch will make data reloc tree a non-shareable tree, and add
      btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code
      can grab it from fs_info directly.
      
      This would slightly improve tree relocation, as now data reloc tree
      can go through regular COW routine to get relocated, without bothering
      the complex tree reloc tree routine.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aeb935a4
    • Qu Wenruo's avatar
      btrfs: inode: cleanup the log-tree exceptions in btrfs_truncate_inode_items() · 82028e0a
      Qu Wenruo authored
      There are a lot of root owner checks in btrfs_truncate_inode_items()
      like:
      
      	if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) ||
      	    root == fs_info->tree_root)
      
      But considering that, only these trees can have INODE_ITEMs:
      
      - tree root (for v1 space cache)
      - subvolume trees
      - tree reloc trees
      - data reloc tree
      - log trees
      
      And since subvolume/tree reloc/data reloc trees all have SHAREABLE bit,
      and we're checking tree root manually, so above check is just excluding
      log trees.
      
      This patch will replace two of such checks to a simpler one:
      
      	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
      
      This would merge btrfs_drop_extent_cache() and lock_extent_bits() call
      into the same if branch.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82028e0a
    • Qu Wenruo's avatar
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo authored
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92a7cc42
    • Anand Jain's avatar
      btrfs: drop stale reference to volume_mutex · ae3e715f
      Anand Jain authored
      Commit dccdb07b ("btrfs: kill btrfs_fs_info::volume_mutex") removed
      the last use of the volume_mutex, forgetting to update the comment.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae3e715f
    • David Sterba's avatar
      583e4a23
    • David Sterba's avatar
      btrfs: optimize split page write in btrfs_set_token_##bits · f472d3c2
      David Sterba authored
      The fallback path calls helper write_extent_buffer to do write of the
      data spanning two extent buffer pages. As the size is known, we can do
      the write directly in two steps.  This removes one function call and
      compiler can optimize memcpy as the sizes are known at compile time. The
      cached token address is set to the second page.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f472d3c2
    • David Sterba's avatar
      btrfs: optimize split page write in btrfs_set_##bits · f4ca8c51
      David Sterba authored
      The helper write_extent_buffer is called to do write of the data
      spanning two extent buffer pages. As the size is known, we can do the
      write directly in two steps.  This removes one function call and
      compiler can optimize memcpy as the sizes are known at compile time.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4ca8c51
    • David Sterba's avatar
      btrfs: optimize split page read in btrfs_get_token_##bits · ba8a9a05
      David Sterba authored
      The fallback path calls helper read_extent_buffer to do read of the data
      spanning two extent buffer pages. As the size is known, we can do the
      read directly in two steps.  This removes one function call and compiler
      can optimize memcpy as the sizes are known at compile time. The cached
      token address is set to the second page.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba8a9a05
    • David Sterba's avatar
      btrfs: optimize split page read in btrfs_get_##bits · 84da071f
      David Sterba authored
      The helper read_extent_buffer is called to do read of the data spanning
      two extent buffer pages. As the size is known, we can do the read
      directly in two steps.  This removes one function call and compiler can
      optimize memcpy as the sizes are known at compile time.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84da071f
    • David Sterba's avatar
      btrfs: drop unnecessary offset_in_page in extent buffer helpers · c60ac0ff
      David Sterba authored
      Helpers that iterate over extent buffer pages set up several variables,
      one of them is finding out offset of the extent buffer start within a
      page. Right now we have extent buffers aligned to page sizes so this is
      effectively storing zero. This makes the code harder the follow and can
      be simplified.
      
      The same change is done in all the helpers:
      
      * remove: size_t start_offset = offset_in_page(eb->start);
      * simplify code using start_offset
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c60ac0ff
    • David Sterba's avatar
      btrfs: constify extent_buffer in the API functions · 2b48966a
      David Sterba authored
      There are many helpers around extent buffers, found in extent_io.h and
      ctree.h. Most of them can be converted to take constified eb as there
      are no changes to the extent buffer structure itself but rather the
      pages.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b48966a
    • David Sterba's avatar
      btrfs: remove unused map_private_extent_buffer · db3756c8
      David Sterba authored
      All uses of map_private_extent_buffer have been replaced by more
      effective way. The set/get helpers have their own bounds checker.
      The function name was confusing since the non-private helper was removed
      in a6591715 ("Btrfs: stop using highmem for extent_buffers") many
      years ago.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db3756c8
    • David Sterba's avatar
      btrfs: speed up and simplify generic_bin_search · 5cd17f34
      David Sterba authored
      The bin search jumps over the extent buffer item keys, comparing
      directly the bytes if the key is in one page, or storing it in a
      temporary buffer in case it spans two pages.
      
      The mapping start and length are obtained from map_private_extent_buffer,
      which is heavy weight compared to what we need. We know the key size and
      can find out the eb page in a simple way.  For keys spanning two pages
      the fallback read_extent_buffer is used.
      
      The temporary variables are reduced and moved to the scope of use.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5cd17f34
    • David Sterba's avatar
      btrfs: speed up btrfs_set_token_##bits helpers · ce7afe87
      David Sterba authored
      The set/get token helpers either use the cached address in the token or
      unconditionally call map_private_extent_buffer to get the address of
      page containing the requested offset plus the mapping start and length.
      Depending on the return value, the fast path uses unaligned put to write
      data within a page, or fall back to write_extent_buffer that can handle
      writes spanning more pages.
      
      This is all wasteful. We know the number of bytes to write, 1/2/4/8 and
      can find out the page. Then simply check if it's contained or the
      fallback is needed. The token address is updated to the page, or the on
      the next index, expecting that the next write will use that.
      
      This saves one function call to map_private_extent_buffer and several
      unnecessary temporary variables.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce7afe87
    • David Sterba's avatar
      btrfs: speed up btrfs_set_##bits helpers · 029e4a42
      David Sterba authored
      The helpers unconditionally call map_private_extent_buffer to get the
      address of page containing the requested offset plus the mapping start
      and length. Depending on the return value, the fast path uses unaligned
      put to write data within a page, or fall back to write_extent_buffer
      that can handle writes spanning more pages.
      
      This is all wasteful. We know the number of bytes to write, 1/2/4/8 and
      can find out the page. Then simply check if it's contained or the
      fallback is needed.
      
      This saves one function call to map_private_extent_buffer and several
      unnecessary temporary variables.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      029e4a42
    • David Sterba's avatar
      btrfs: speed up btrfs_get_token_##bits helpers · 8f9da810
      David Sterba authored
      The set/get token helpers either use the cached address in the token or
      unconditionally call map_private_extent_buffer to get the address of
      page containing the requested offset plus the mapping start and length.
      Depending on the return value, the fast path uses unaligned read to get
      data within a page, or fall back to read_extent_buffer that can handle
      reads spanning more pages.
      
      This is all wasteful. We know the number of bytes to read, 1/2/4/8 and
      can find out the page. Then simply check if it's contained or the
      fallback is needed. The token address is updated to the page, or the on
      the next index, expecting that the next read will use that.
      
      This saves one function call to map_private_extent_buffer and several
      unnecessary temporary variables.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8f9da810
    • David Sterba's avatar
      btrfs: speed up btrfs_get_##bits helpers · 1441ed9b
      David Sterba authored
      The helpers unconditionally call map_private_extent_buffer to get the
      address of page containing the requested offset plus the mapping start
      and length. Depending on the return value, the fast path uses unaligned
      read to get data within a page, or fall back to read_extent_buffer that
      can handle reads spanning more pages.
      
      This is all wasteful. We know the number of bytes to read, 1/2/4/8 and
      can find out the page. Then simply check if it's contained or the
      fallback is needed.
      
      This saves one function call to map_private_extent_buffer and several
      unnecessary temporary variables.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1441ed9b