1. 17 May, 2018 5 commits
    • Nikolay Borisov's avatar
      btrfs: Fix delalloc inodes invalidation during transaction abort · fe816d0f
      Nikolay Borisov authored
      When a transaction is aborted btrfs_cleanup_transaction is called to
      cleanup all the various in-flight bits and pieces which migth be
      active. One of those is delalloc inodes - inodes which have dirty
      pages which haven't been persisted yet. Currently the process of
      freeing such delalloc inodes in exceptional circumstances such as
      transaction abort boiled down to calling btrfs_invalidate_inodes whose
      sole job is to invalidate the dentries for all inodes related to a
      root. This is in fact wrong and insufficient since such delalloc inodes
      will likely have pending pages or ordered-extents and will be linked to
      the sb->s_inode_list. This means that unmounting a btrfs instance with
      an aborted transaction could potentially lead inodes/their pages
      visible to the system long after their superblock has been freed. This
      in turn leads to a "use-after-free" situation once page shrink is
      triggered. This situation could be simulated by running generic/019
      which would cause such inodes to be left hanging, followed by
      generic/176 which causes memory pressure and page eviction which lead
      to touching the freed super block instance. This situation is
      additionally detected by the unmount code of VFS with the following
      message:
      
      "VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."
      
      Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      in free_fs_root for the same reason.
      
      This patch aims to rectify the sitaution by doing the following:
      
      1. Change btrfs_destroy_delalloc_inodes so that it calls
      invalidate_inode_pages2 for every inode on the delalloc list, this
      ensures that all the pages of the inode are released. This function
      boils down to calling btrfs_releasepage. During test I observed cases
      where inodes on the delalloc list were having an i_count of 0, so this
      necessitates using igrab to be sure we are working on a non-freed inode.
      
      2. Since calling btrfs_releasepage might queue delayed iputs move the
      call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
      calling run_delayed_iputs for the last time. This is necessary to ensure
      that delayed iputs are run.
      
      Note: this patch is tagged for 4.14 stable but the fix applies to older
      versions too but needs to be backported manually due to conflicts.
      
      CC: stable@vger.kernel.org # 4.14.x: 2b877331: btrfs: Split btrfs_del_delalloc_inode into 2 functions
      CC: stable@vger.kernel.org # 4.14.x
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment to igrab ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe816d0f
    • Nikolay Borisov's avatar
      btrfs: Split btrfs_del_delalloc_inode into 2 functions · 2b877331
      Nikolay Borisov authored
      This is in preparation of fixing delalloc inodes leakage on transaction
      abort. Also export the new function.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b877331
    • Liu Bo's avatar
      btrfs: fix reading stale metadata blocks after degraded raid1 mounts · 02a3307a
      Liu Bo authored
      If a btree block, aka. extent buffer, is not available in the extent
      buffer cache, it'll be read out from the disk instead, i.e.
      
      btrfs_search_slot()
        read_block_for_search()  # hold parent and its lock, go to read child
          btrfs_release_path()
          read_tree_block()  # read child
      
      Unfortunately, the parent lock got released before reading child, so
      commit 5bdd3536 ("Btrfs: Fix block generation verification race") had
      used 0 as parent transid to read the child block.  It forces
      read_tree_block() not to check if parent transid is different with the
      generation id of the child that it reads out from disk.
      
      A simple PoC is included in btrfs/124,
      
      0. A two-disk raid1 btrfs,
      
      1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
      
      2. Mount this filesystem and put it in use, after a while, device tree's
         root got COW but block A hasn't been allocated/overwritten yet.
      
      3. Umount it and reload the btrfs module to remove both disks from the
         global @fs_devices list.
      
      4. mount -odegraded dev1 and write some data, so now block A is allocated
         to be a leaf in checksum tree.  Note that only dev1 has the latest
         metadata of this filesystem.
      
      5. Umount it and mount it again normally (with both disks), since raid1
         can pick up one disk by the writer task's pid, if btrfs_search_slot()
         needs to read block A, dev2 which does NOT have the latest metadata
         might be read for block A, then we got a stale block A.
      
      6. As parent transid is not checked, block A is marked as uptodate and
         put into the extent buffer cache, so the future search won't bother
         to read disk again, which means it'll make changes on this stale
         one and make it dirty and flush it onto disk.
      
      To avoid the problem, parent transid needs to be passed to
      read_tree_block().
      
      In order to get a valid parent transid, we need to hold the parent's
      lock until finishing reading child.
      
      This patch needs to be slightly adapted for stable kernels, the
      &first_key parameter added to read_tree_block() is from 4.16+
      (581c1760). The fix is to replace 0 by 'gen'.
      
      Fixes: 5bdd3536 ("Btrfs: Fix block generation verification race")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02a3307a
    • Misono Tomohiro's avatar
      btrfs: property: Set incompat flag if lzo/zstd compression is set · 1a63c198
      Misono Tomohiro authored
      Incompat flag of LZO/ZSTD compression should be set at:
      
       1. mount time (-o compress/compress-force)
       2. when defrag is done
       3. when property is set
      
      Currently 3. is missing and this commit adds this.
      
      This could lead to a filesystem that uses ZSTD but is not marked as
      such. If a kernel without a ZSTD support encounteres a ZSTD compressed
      extent, it will handle that but this could be confusing to the user.
      
      Typically the filesystem is mounted with the ZSTD option, but the
      discrepancy can arise when a filesystem is never mounted with ZSTD and
      then the property on some file is set (and some new extents are
      written). A simple mount with -o compress=zstd will fix that up on an
      unpatched kernel.
      
      Same goes for LZO, but this has been around for a very long time
      (2.6.37) so it's unlikely that a pre-LZO kernel would be used.
      
      Fixes: 5c1aab1d ("btrfs: Add zstd support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add user visible impact ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a63c198
    • Filipe Manana's avatar
      Btrfs: fix duplicate extents after fsync of file with prealloc extents · 31d11b83
      Filipe Manana authored
      In commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay"), on fsync,  we started to always log all prealloc
      extents beyond an inode's i_size in order to avoid losing them after a
      power failure. However under some cases this can lead to the log replay
      code to create duplicate extent items, with different lengths, in the
      extent tree. That happens because, as of that commit, we can now log
      extent items based on extent maps that are not on the "modified" list
      of extent maps of the inode's extent map tree. Logging extent items based
      on extent maps is used during the fast fsync path to save time and for
      this to work reliably it requires that the extent maps are not merged
      with other adjacent extent maps - having the extent maps in the list
      of modified extents gives such guarantee.
      
      Consider the following example, captured during a long run of fsstress,
      which illustrates this problem.
      
      We have inode 271, in the filesystem tree (root 5), for which all of the
      following operations and discussion apply to.
      
      A buffered write starts at offset 312391 with a length of 933471 bytes
      (end offset at 1245862). At this point we have, for this inode, the
      following extent maps with the their field values:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
            block_len 376832, orig_block_len 376832
      em C, start 417792, orig_start 417792, len 782336, block_start
            18446744073709551613, block_len 0, orig_block_len 0
      em D, start 1200128, orig_start 1200128, len 835584, block_start
            1106776064, block_len 835584, orig_block_len 835584
      em E, start 2035712, orig_start 2035712, len 245760, block_start
            1107611648, block_len 245760, orig_block_len 245760
      
      Extent map A corresponds to a hole and extent maps D and E correspond to
      preallocated extents.
      
      Extent map D ends where extent map E begins (1106776064 + 835584 =
      1107611648), but these extent maps were not merged because they are in
      the inode's list of modified extent maps.
      
      An fsync against this inode is made, which triggers the fast path
      (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
      of the data previously written using buffered IO, and when the respective
      ordered extent finishes, btrfs_drop_extents() is called against the
      (aligned) range 311296..1249279. This causes a split of extent map D at
      btrfs_drop_extent_cache(), replacing extent map D with a new extent map
      D', also added to the list of modified extents,  with the following
      values:
      
      em D', start 1249280, orig_start of 1200128,
             block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
             orig_block_len 835584,
             block_len 786432 (835584 - (1249280 - 1200128))
      
      Then, during the fast fsync, btrfs_log_changed_extents() is called and
      extent maps D' and E are removed from the list of modified extents. The
      flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
      clear_em_logging() is called on each of them, and that makes extent map E
      to be merged with extent map D' (try_merge_map()), resulting in D' being
      deleted and E adjusted to:
      
      em E, start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192,
            orig_block_len 245760
      
      A direct IO write at offset 1847296 and length of 360448 bytes (end offset
      at 2207744) starts, and at that moment the following extent maps exist for
      our inode:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192, orig_block_len 245760
      
      The dio write results in drop_extent_cache() being called twice. The first
      time for a range that starts at offset 1847296 and ends at offset 2035711
      (length of 188416), which results in a double split of extent map E,
      replacing it with two new extent maps:
      
      em F, start 1249280, orig_start 1200128, block_start 1106825216,
            block_len 598016, orig_block_len 598016
      em G, start 2035712, orig_start 1200128, block_start 1107611648,
            block_len 245760, orig_block_len 1032192
      
      It also creates a new extent map that represents a part of the requested
      IO (through create_io_em()):
      
      em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
      
      The second call to drop_extent_cache() has a range with a start offset of
      2035712 and end offset of 2207743 (length of 172032). This leads to
      replacing extent map G with a new extent map I with the following values:
      
      em I, start 2207744, orig_start 1200128, block_start 1107783680,
            block_len 73728, orig_block_len 1032192
      
      It also creates a new extent map that represents the second part of the
      requested IO (through create_io_em()):
      
      em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
      
      The dio write set the inode's i_size to 2207744 bytes.
      
      After the dio write the inode has the following extent maps:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em F, start 1249280, orig_start 1200128, len 598016,
            block_start 1106825216, block_len 598016, orig_block_len 598016
      em H, start 1847296, orig_start 1200128, len 188416,
            block_start 1107423232, block_len 188416, orig_block_len 835584
      em J, start 2035712, orig_start 2035712, len 172032,
            block_start 1107611648, block_len 172032, orig_block_len 245760
      em I, start 2207744, orig_start 1200128, len 73728,
            block_start 1107783680, block_len 73728, orig_block_len 1032192
      
      Now do some change to the file, like adding a xattr for example and then
      fsync it again. This triggers a fast fsync path, and as of commit
      471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync
      log replay"), we use the extent map I to log a file extent item because
      it's a prealloc extent and it starts at an offset matching the inode's
      i_size. However when we log it, we create a file extent item with a value
      for the disk byte location that is wrong, as can be seen from the
      following output of "btrfs inspect-internal dump-tree":
      
       item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
           generation 22 type 2 (prealloc)
           prealloc data disk byte 1106776064 nr 1032192
           prealloc data offset 1007616 nr 73728
      
      Here the disk byte value corresponds to calculation based on some fields
      from the extent map I:
      
        1106776064 = block_start (1107783680) - 1007616 (extent_offset)
        extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
      
      The disk byte value of 1106776064 clashes with disk byte values of the
      file extent items at offsets 1249280 and 1847296 in the fs tree:
      
              item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1106776064 nr 835584
                      prealloc data offset 49152 nr 598016
              item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1106776064 nr 835584
                      extent data offset 647168 nr 188416 ram 835584
                      extent compression 0 (none)
              item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1107611648 nr 245760
                      extent data offset 0 nr 172032 ram 245760
                      extent compression 0 (none)
              item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1107611648 nr 245760
                      prealloc data offset 172032 nr 73728
      
      Instead of the disk byte value of 1106776064, the value of 1107611648
      should have been logged. Also the data offset value should have been
      172032 and not 1007616.
      After a log replay we end up getting two extent items in the extent tree
      with different lengths, one of 835584, which is correct and existed
      before the log replay, and another one of 1032192 which is wrong and is
      based on the logged file extent item:
      
       item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
          refs 2 gen 15 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 2
       item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
          refs 1 gen 22 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 1
      
      Obviously this leads to many problems and a filesystem check reports many
      errors:
      
       (...)
       checking extents
       Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
       extent item 1106776064 has multiple extent items
       ref mismatch on [1106776064 835584] extent item 2, found 3
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
       Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
       Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
       backpointer mismatch on [1106776064 835584]
       checking free space cache
       block group 1103101952 has wrong amount of free space
       failed to load free space cache for block group 1103101952
       checking fs roots
       (...)
      
      So fix this by logging the prealloc extents beyond the inode's i_size
      based on searches in the subvolume tree instead of the extent maps.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31d11b83
  2. 14 May, 2018 2 commits
    • Filipe Manana's avatar
      Btrfs: fix xattr loss after power failure · 9a8fca62
      Filipe Manana authored
      If a file has xattrs, we fsync it, to ensure we clear the flags
      BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
      inode, the current transaction commits and then we fsync it (without
      either of those bits being set in its inode), we end up not logging
      all its xattrs. This results in deleting all xattrs when replying the
      log after a power failure.
      
      Trivial reproducer
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foobar
        $ setfattr -n user.xa -v qwerty /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ getfattr --absolute-names --dump /mnt/foobar
        <empty output>
        $
      
      So fix this by making sure all xattrs are logged if we log a file's inode
      item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
      BTRFS_INODE_COPY_EVERYTHING were set in the inode.
      
      Fixes: 36283bf7 ("Btrfs: fix fsync xattr loss in the fast fsync path")
      Cc: <stable@vger.kernel.org> # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a8fca62
    • Robbie Ko's avatar
      Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting · 6f2f0b39
      Robbie Ko authored
      [BUG]
      btrfs incremental send BUG happens when creating a snapshot of snapshot
      that is being used by send.
      
      [REASON]
      The problem can happen if while we are doing a send one of the snapshots
      used (parent or send) is snapshotted, because snapshoting implies COWing
      the root of the source subvolume/snapshot.
      
      1. When doing an incremental send, the send process will get the commit
         roots from the parent and send snapshots, and add references to them
         through extent_buffer_get().
      
      2. When a snapshot/subvolume is snapshotted, its root node is COWed
         (transaction.c:create_pending_snapshot()).
      
      3. COWing releases the space used by the node immediately, through:
      
         __btrfs_cow_block()
         --btrfs_free_tree_block()
         ----btrfs_add_free_space(bytenr of node)
      
      4. Because send doesn't hold a transaction open, it's possible that
         the transaction used to create the snapshot commits, switches the
         commit root and the old space used by the previous root node gets
         assigned to some other node allocation. Allocation of a new node will
         use the existing extent buffer found in memory, which we previously
         got a reference through extent_buffer_get(), and allow the extent
         buffer's content (pages) to be modified:
      
         btrfs_alloc_tree_block
         --btrfs_reserve_extent
         ----find_free_extent (get bytenr of old node)
         --btrfs_init_new_buffer (use bytenr of old node)
         ----btrfs_find_create_tree_block
         ------alloc_extent_buffer
         --------find_extent_buffer (get old node)
      
      5. So send can access invalid memory content and have unpredictable
         behaviour.
      
      [FIX]
      So we fix the problem by copying the commit roots of the send and
      parent snapshots and use those copies.
      
      CallTrace looks like this:
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:1861!
       invalid opcode: 0000 [#1] SMP
       CPU: 6 PID: 24235 Comm: btrfs Tainted: P           O 3.10.105 #23721
       ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
       RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
       RSP: 0018:ffff88041b723b68  EFLAGS: 00010246
       RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
       RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
       RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
       R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
       R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
       FS:  00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Stack:
       ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
       ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
       ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
       Call Trace:
       [<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
       [<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
       [<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
       [<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
       [<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
       [<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
       [<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
       [<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
       [<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
       [<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
       [<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
       [<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
       [<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
       [<ffffffff81034f83>] ? do_fork+0x113/0x3b0
       [<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
       [<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
       [<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
       ---[ end trace 29576629ee80b2e1 ]---
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 3.6+
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f2f0b39
  3. 02 May, 2018 2 commits
    • Filipe Manana's avatar
      Btrfs: send, fix missing truncate for inode with prealloc extent past eof · a6aa10c7
      Filipe Manana authored
      An incremental send operation can miss a truncate operation when an inode
      has an increased size in the send snapshot and a prealloc extent beyond
      its size.
      
      Consider the following scenario where a necessary truncate operation is
      missing in the incremental send stream:
      
      1) In the parent snapshot an inode has a size of 1282957 bytes and it has
         no prealloc extents beyond its size;
      
      2) In the the send snapshot it has a size of 5738496 bytes and has a new
         extent at offsets 1884160 (length of 106496 bytes) and a prealloc
         extent beyond eof at offset 6729728 (and a length of 339968 bytes);
      
      3) When processing the prealloc extent, at offset 6729728, we end up at
         send.c:send_write_or_clone() and set the @len variable to a value of
         18446744073708560384 because @offset plus the original @len value is
         larger then the inode's size (6729728 + 339968 > 5738496). We then
         call send_extent_data(), with that @offset and @len, which in turn
         calls send_write(), and then the later calls fill_read_buf(). Because
         the offset passed to fill_read_buf() is greater then inode's i_size,
         this function returns 0 immediately, which makes send_write() and
         send_extent_data() do nothing and return immediately as well. When
         we get back to send.c:send_write_or_clone() we adjust the value
         of sctx->cur_inode_next_write_offset to @offset plus @len, which
         corresponds to 6729728 + 18446744073708560384 = 5738496, which is
         precisely the the size of the inode in the send snapshot;
      
      4) Later when at send.c:finish_inode_if_needed() we determine that
         we don't need to issue a truncate operation because the value of
         sctx->cur_inode_next_write_offset corresponds to the inode's new
         size, 5738496 bytes. This is wrong because the last write operation
         that was issued started at offset 1884160 with a length of 106496
         bytes, so the correct value for sctx->cur_inode_next_write_offset
         should be 1990656 (1884160 + 106496), so that a truncate operation
         with a value of 5738496 bytes would have been sent to insert a
         trailing hole at the destination.
      
      So fix the issue by making send.c:send_write_or_clone() not attempt
      to send write or clone operations for extents that start beyond the
      inode's size, since such attempts do nothing but waste time by
      calling helper functions and allocating path structures, and send
      currently has no fallocate command in order to create prealloc extents
      at the destination (either beyond a file's eof or not).
      
      The issue was found running the test btrfs/007 from fstests using a seed
      value of 1524346151 for fsstress.
      Reported-by: default avatarGu, Jinxiang <gujx@cn.fujitsu.com>
      Fixes: ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6aa10c7
    • ethanwu's avatar
      btrfs: Take trans lock before access running trans in check_delayed_ref · 998ac6d2
      ethanwu authored
      In preivous patch:
      Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
      We avoid starting btrfs transaction and get this information from
      fs_info->running_transaction directly.
      
      When accessing running_transaction in check_delayed_ref, there's a
      chance that current transaction will be freed by commit transaction
      after the NULL pointer check of running_transaction is passed.
      
      After looking all the other places using fs_info->running_transaction,
      they are either protected by trans_lock or holding the transactions.
      
      Fix this by using trans_lock and increasing the use_count.
      
      Fixes: e4c3b2dc ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarethanwu <ethanwu@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      998ac6d2
  4. 26 Apr, 2018 1 commit
    • Qu Wenruo's avatar
      btrfs: Fix wrong first_key parameter in replace_path · 17515f1b
      Qu Wenruo authored
      Commit 581c1760 ("btrfs: Validate child tree block's level and first
      key") introduced new @first_key parameter for read_tree_block(), however
      caller in replace_path() is parasing wrong key to read_tree_block().
      
      It should use parameter @first_key other than @key.
      
      Normally it won't expose problem as @key is normally initialzied to the
      same value of @first_key we expect.
      However in relocation recovery case, @key can be set to (0, 0, 0), and
      since no valid key in relocation tree can be (0, 0, 0), it will cause
      read_tree_block() to return -EUCLEAN and interrupt relocation recovery.
      
      Fix it by setting @first_key correctly.
      
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17515f1b
  5. 20 Apr, 2018 2 commits
    • Qu Wenruo's avatar
      btrfs: print-tree: debugging output enhancement · c0872323
      Qu Wenruo authored
      This patch enhances the following things:
      
      - tree block header
        * add generation and owner output for node and leaf
      - node pointer generation output
      - allow btrfs_print_tree() to not follow nodes
        * just like btrfs-progs
      
      Please note that, although function btrfs_print_tree() is not called by
      anyone right now, it's still a pretty useful function to debug kernel.
      So that function is still kept for later use.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0872323
    • Nikolay Borisov's avatar
      btrfs: Fix race condition between delayed refs and blockgroup removal · 5e388e95
      Nikolay Borisov authored
      When the delayed refs for a head are all run, eventually
      cleanup_ref_head is called which (in case of deletion) obtains a
      reference for the relevant btrfs_space_info struct by querying the bg
      for the range. This is problematic because when the last extent of a
      bg is deleted a race window emerges between removal of that bg and the
      subsequent invocation of cleanup_ref_head. This can result in cache being null
      and either a null pointer dereference or assertion failure.
      
      	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
      	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
      	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
      	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
      	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
      	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
      	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
      	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
      	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
      	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
      	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
      	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
      	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
      	 evict+0xc6/0x190
      	 do_unlinkat+0x19c/0x300
      	 do_syscall_64+0x74/0x140
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7fbf589c57a7
      
      To fix this, introduce a new flag "is_system" to head_ref structs,
      which is populated at insertion time. This allows to decouple the
      querying for the spaceinfo from querying the possibly deleted bg.
      
      Fixes: d7eae340 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
      CC: stable@vger.kernel.org # 4.14+
      Suggested-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e388e95
  6. 18 Apr, 2018 5 commits
    • David Sterba's avatar
      btrfs: fix unaligned access in readdir · 92d32170
      David Sterba authored
      The last update to readdir introduced a temporary buffer to store the
      emitted readdir data, but as there are file names of variable length,
      there's a lot of unaligned access.
      
      This was observed on a sparc64 machine:
      
        Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs]
      
      Fixes: 23b5ec74 ("btrfs: fix readdir deadlock with pagefault")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: default avatarRené Rebe <rene@exactcode.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92d32170
    • Qu Wenruo's avatar
      btrfs: Fix wrong btrfs_delalloc_release_extents parameter · 336a8bb8
      Qu Wenruo authored
      Commit 43b18595 ("btrfs: qgroup: Use separate meta reservation type
      for delalloc") merged into mainline is not the latest version submitted
      to mail list in Dec 2017.
      
      It has a fatal wrong @qgroup_free parameter, which results increasing
      qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT.
      
      Fix it by applying the correct diff on top of current branch.
      
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      336a8bb8
    • Qu Wenruo's avatar
      btrfs: delayed-inode: Remove wrong qgroup meta reservation calls · f218ea6c
      Qu Wenruo authored
      Commit 4f5427cc ("btrfs: delayed-inode: Use new qgroup meta rsv for
      delayed inode and item") merged into mainline was not latest version
      submitted to the mail list in Dec 2017.
      
      Which lacks the following fixes:
      
      1) Remove btrfs_qgroup_convert_reserved_meta() call in
         btrfs_delayed_item_release_metadata()
      2) Remove btrfs_qgroup_reserve_meta_prealloc() call in
         btrfs_delayed_inode_reserve_metadata()
      
      Those fixes will resolve unexpected EDQUOT problems.
      
      Fixes: 4f5427cc ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f218ea6c
    • Qu Wenruo's avatar
      btrfs: qgroup: Use independent and accurate per inode qgroup rsv · ff6bc37e
      Qu Wenruo authored
      Unlike reservation calculation used in inode rsv for metadata, qgroup
      doesn't really need to care about things like csum size or extent usage
      for the whole tree COW.
      
      Qgroups care more about net change of the extent usage.
      That's to say, if we're going to insert one file extent, it will mostly
      find its place in COWed tree block, leaving no change in extent usage.
      Or causing a leaf split, resulting in one new net extent and increasing
      qgroup number by nodesize.
      Or in an even more rare case, increase the tree level, increasing qgroup
      number by 2 * nodesize.
      
      So here instead of using the complicated calculation for extent
      allocator, which cares more about accuracy and no error, qgroup doesn't
      need that over-estimated reservation.
      
      This patch will maintain 2 new members in btrfs_block_rsv structure for
      qgroup, using much smaller calculation for qgroup rsv, reducing false
      EDQUOT.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      ff6bc37e
    • Qu Wenruo's avatar
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo authored
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a514d638
  7. 13 Apr, 2018 1 commit
    • Qu Wenruo's avatar
      btrfs: Only check first key for committed tree blocks · 5d41be6f
      Qu Wenruo authored
      When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
      kernel warning due to first key verification:
      
      [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.523830] Modules linked in:
      [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.527101] Call Trace:
      [ 4239.527251]  read_tree_block+0x42/0x70
      [ 4239.527434]  read_node_slot+0xd2/0x110
      [ 4239.527632]  push_leaf_right+0xad/0x1b0
      [ 4239.527809]  split_leaf+0x4ea/0x700
      [ 4239.527988]  ? leaf_space_used+0xbc/0xe0
      [ 4239.528192]  ? btrfs_set_lock_blocking_rw+0x99/0xb0
      [ 4239.528416]  btrfs_search_slot+0x8cc/0xa40
      [ 4239.528605]  btrfs_insert_empty_items+0x71/0xc0
      [ 4239.528798]  __btrfs_run_delayed_refs+0xa98/0x1680
      [ 4239.529013]  btrfs_run_delayed_refs+0x10b/0x1b0
      [ 4239.529205]  btrfs_commit_transaction+0x33/0xaf0
      [ 4239.529445]  ? start_transaction+0xa8/0x4f0
      [ 4239.529630]  btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
      [ 4239.529833]  btrfs_check_data_free_space+0x54/0xa0
      [ 4239.530045]  btrfs_delalloc_reserve_space+0x25/0x70
      [ 4239.531907]  btrfs_direct_IO+0x233/0x3d0
      [ 4239.532098]  generic_file_direct_write+0xcb/0x170
      [ 4239.532296]  btrfs_file_write_iter+0x2bb/0x5f4
      [ 4239.532491]  aio_write+0xe2/0x180
      [ 4239.532669]  ? lock_acquire+0xac/0x1e0
      [ 4239.532839]  ? __might_fault+0x3e/0x90
      [ 4239.533032]  do_io_submit+0x594/0x860
      [ 4239.533223]  ? do_io_submit+0x594/0x860
      [ 4239.533398]  SyS_io_submit+0x10/0x20
      [ 4239.533560]  ? SyS_io_submit+0x10/0x20
      [ 4239.533729]  do_syscall_64+0x75/0x1d0
      [ 4239.533979]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [ 4239.534182] RIP: 0033:0x7f8519741697
      
      The problem here is, at btree_read_extent_buffer_pages() we don't have
      acquired read/write lock on that extent buffer, only basic info like
      level/bytenr is reliable.
      
      So race condition leads to such false alert.
      
      However in current call site, it's impossible to acquire proper lock
      without race window.
      To fix the problem, we only verify first key for committed tree blocks
      (whose generation is no larger than fs_info->last_trans_committed), so
      the content of such tree blocks will not change and there is no need to
      get read/write lock.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d41be6f
  8. 12 Apr, 2018 5 commits
    • David Sterba's avatar
      btrfs: add SPDX header to Kconfig · 852eb3ae
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      852eb3ae
    • David Sterba's avatar
      btrfs: replace GPL boilerplate by SPDX -- sources · c1d7c514
      David Sterba authored
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1d7c514
    • David Sterba's avatar
      btrfs: replace GPL boilerplate by SPDX -- headers · 9888c340
      David Sterba authored
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      
      Unify the include protection macros to match the file names.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9888c340
    • Filipe Manana's avatar
      Btrfs: fix loss of prealloc extents past i_size after fsync log replay · 471d557a
      Filipe Manana authored
      Currently if we allocate extents beyond an inode's i_size (through the
      fallocate system call) and then fsync the file, we log the extents but
      after a power failure we replay them and then immediately drop them.
      This behaviour happens since about 2009, commit c71bf099 ("Btrfs:
      Avoid orphan inodes cleanup while replaying log"), because it marks
      the inode as an orphan instead of dropping any extents beyond i_size
      before replaying logged extents, so after the log replay, and while
      the mount operation is still ongoing, we find the inode marked as an
      orphan and then perform a truncation (drop extents beyond the inode's
      i_size). Because the processing of orphan inodes is still done
      right after replaying the log and before the mount operation finishes,
      the intention of that commit does not make any sense (at least as
      of today). However reverting that behaviour is not enough, because
      we can not simply discard all extents beyond i_size and then replay
      logged extents, because we risk dropping extents beyond i_size created
      in past transactions, for example:
      
        add prealloc extent beyond i_size
        fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
        transaction commit
        add another prealloc extent beyond i_size
        fsync - triggers the fast fsync path
        power failure
      
      In that scenario, we would drop the first extent and then replay the
      second one. To fix this just make sure that all prealloc extents
      beyond i_size are logged, and if we find too many (which is far from
      a common case), fallback to a full transaction commit (like we do when
      logging regular extents in the fast fsync path).
      
      Trivial reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
       $ sync
       $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
       $ xfs_io -c "fsync" /mnt/foo
       <power failure>
      
       # mount to replay log
       $ mount /dev/sdb /mnt
       # at this point the file only has one extent, at offset 0, size 256K
      
      A test case for fstests follows soon, covering multiple scenarios that
      involve adding prealloc extents with previous shrinking truncates and
      without such truncates.
      
      Fixes: c71bf099 ("Btrfs: Avoid orphan inodes cleanup while replaying log")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      471d557a
    • Liu Bo's avatar
      Btrfs: clean up resources during umount after trans is aborted · af722733
      Liu Bo authored
      Currently if some fatal errors occur, like all IO get -EIO, resources
      would be cleaned up when
      a) transaction is being committed or
      b) BTRFS_FS_STATE_ERROR is set
      
      However, in some rare cases, resources may be left alone after transaction
      gets aborted and umount may run into some ASSERT(), e.g.
      ASSERT(list_empty(&block_group->dirty_list));
      
      For case a), in btrfs_commit_transaciton(), there're several places at the
      beginning where we just call btrfs_end_transaction() without cleaning up
      resources.  For case b), it is possible that the trans handle doesn't have
      any dirty stuff, then only trans hanlde is marked as aborted while
      BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
      
      This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
      all resources won't stay in memory after umount.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af722733
  9. 05 Apr, 2018 3 commits
  10. 31 Mar, 2018 14 commits
    • David Sterba's avatar
      btrfs: lift errors from add_extent_changeset to the callers · 57599c7e
      David Sterba authored
      The missing error handling in add_extent_changeset was hidden, so make
      it at least visible in the callers.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57599c7e
    • Liu Bo's avatar
      Btrfs: print error messages when failing to read trees · f50f4353
      Liu Bo authored
      When mount fails to read trees like fs tree, checksum tree, extent
      tree, etc, there is not enough information about where went wrong.
      
      With this, messages like
      
      "BTRFS warning (device sdf): failed to read root (objectid=7): -5"
      
      would help us a bit.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f50f4353
    • David Sterba's avatar
      btrfs: user proper type for btrfs_mask_flags flags · 38e82de8
      David Sterba authored
      All users pass a local unsigned int and not the __uXX types that are
      supposed to be used for userspace interfaces.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38e82de8
    • David Sterba's avatar
      btrfs: split dev-replace locking helpers for read and write · 7e79cb86
      David Sterba authored
      The current calls are unclear in what way btrfs_dev_replace_lock takes
      the locks, so drop the argument, split the helpers and use similar
      naming as for read and write locks.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e79cb86
    • David Sterba's avatar
      btrfs: remove stale comments about fs_mutex · e7ab0af6
      David Sterba authored
      The fs_mutex has been killed in 2008, a2135011 ("Btrfs: Replace
      the big fs_mutex with a collection of other locks"), still remembered in
      some comments.
      
      We don't have any extra needs for locking in the ACL handlers.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7ab0af6
    • David Sterba's avatar
      btrfs: use RCU in btrfs_show_devname for device list traversal · 88c14590
      David Sterba authored
      The show_devname callback is used to print device name in
      /proc/self/mounts, we need to traverse the device list consistently and
      read the name that's copied to a seq buffer so we don't need further
      locking.
      
      If the first device is being deleted at the same time, the RCU will
      allow us to read the device name, though it will become stale right
      after the RCU protection ends. This is unavoidable and the user can
      expect that the device will disappear from the filesystem's list at some
      point.
      
      The device_list_mutex was pretty heavy as it is used eg. for writing
      superblock and a few other IO related contexts. This can stall any
      application that reads the proc file for no reason.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88c14590
    • David Sterba's avatar
      btrfs: update barrier in should_cow_block · d1980131
      David Sterba authored
      Once there was a simple int force_cow that was used with the plain
      barriers, and then converted to a bit, so we should use the appropriate
      barrier helper.
      
      Other variables in the complex if condition do not depend on a barrier,
      so we should be fine in case the atomic barrier becomes a no-op.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1980131
    • David Sterba's avatar
      btrfs: use lockdep_assert_held for mutexes · a32bf9a3
      David Sterba authored
      Using lockdep_assert_held is preferred, replace mutex_is_locked.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a32bf9a3
    • David Sterba's avatar
      btrfs: use lockdep_assert_held for spinlocks · a4666e68
      David Sterba authored
      Using lockdep_assert_held is preferred, replace assert_spin_locked.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4666e68
    • Qu Wenruo's avatar
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo authored
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      581c1760
    • Qu Wenruo's avatar
      btrfs: tests/qgroup: Fix wrong tree backref level · 3c0efdf0
      Qu Wenruo authored
      The extent tree of the test fs is like the following:
      
       BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
        item 0 key (4096 168 4096) itemoff 3944 itemsize 51
                extent refs 1 gen 1 flags 2
                tree block key (68719476736 0 0) level 1
                                                 ^^^^^^^
                ref#0: tree block backref root 5
      
      And it's using an empty tree for fs tree, so there is no way that its
      level can be 1.
      
      For REAL (created by mkfs) fs tree backref with no skinny metadata, the
      result should look like:
      
       item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
               refs 1 gen 4 flags TREE_BLOCK
               tree block key (256 INODE_ITEM 0) level 0
                                                 ^^^^^^^
               tree block backref root 5
      
      Fix the level to 0, so it won't break later tree level checker.
      
      Fixes: faa2dbf0 ("Btrfs: add sanity tests for new qgroup accounting code")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c0efdf0
    • Filipe Manana's avatar
      Btrfs: fix copy_items() return value when logging an inode · 8434ec46
      Filipe Manana authored
      When logging an inode, at tree-log.c:copy_items(), if we call
      btrfs_next_leaf() at the loop which checks for the need to log holes, we
      need to make sure copy_items() returns the value 1 to its caller and
      not 0 (on success). This is because the path the caller passed was
      released and is now different from what is was before, and the caller
      expects a return value of 0 to mean both success and that the path
      has not changed, while a return value of 1 means both success and
      signals the caller that it can not reuse the path, it has to perform
      another tree search.
      
      Even though this is a case that should not be triggered on normal
      circumstances or very rare at least, its consequences can be very
      unpredictable (especially when replaying a log tree).
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8434ec46
    • Filipe Manana's avatar
      Btrfs: fix fsync after hole punching when using no-holes feature · 4ee3fad3
      Filipe Manana authored
      When we have the no-holes mode enabled and fsync a file after punching a
      hole in it, we can end up not logging the whole hole range in the log tree.
      This happens if the file has extent items that span more than one leaf and
      we punch a hole that covers a range that starts in a leaf but does not go
      beyond the offset of the first extent in the next leaf.
      
      Example:
      
        $ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb
        $ mount /dev/sdb /mnt
        $ for ((i = 0; i <= 831; i++)); do
      	offset=$((i * 2 * 256 * 1024))
      	xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \
      		/mnt/foobar >/dev/null
          done
        $ sync
      
        # We now have 2 leafs in our filesystem fs tree, the first leaf has an
        # item corresponding the extent at file offset 216530944 and the second
        # leaf has a first item corresponding to the extent at offset 217055232.
        # Now we punch a hole that partially covers the range of the extent at
        # offset 216530944 but does go beyond the offset 217055232.
      
        $ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        <power fail>
      
        # mount to replay the log
        $ mount /dev/sdb /mnt
      
        # Before this patch, only the subrange [216658016, 216662016[ (length of
        # 4000 bytes) was logged, leaving an incorrect file layout after log
        # replay.
      
      Fix this by checking if there is a hole between the last extent item that
      we processed and the first extent item in the next leaf, and if there is
      one, log an explicit hole extent item.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ee3fad3
    • David Sterba's avatar
      btrfs: use helper to set ulist aux from a qgroup · a1840b50
      David Sterba authored
      We have a nice helper to do proper casting of a qgroup to a ulist aux
      value. And several places that could make use of it.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1840b50