• Filipe Manana's avatar
    Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812
    Filipe Manana authored
    Under rare circumstances we can end up leaving 2 versions of a checksum
    for the same file extent range.
    
    The reason for this is that after calling btrfs_next_leaf we process
    slot 0 of the leaf it returns, instead of processing the slot set in
    path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
    btrfs_next_leaf() releases the path and before it searches for the next
    leaf, another task might cause a split of the next leaf, which migrates
    some of its keys to the leaf we were processing before calling
    btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
    same leaf but with path->slots[0] having a slot number corresponding
    to the first new key it got, that is, a slot number that didn't exist
    before calling btrfs_next_leaf(), as the leaf now has more keys than
    it had before. So we must really process the returned leaf starting at
    path->slots[0] always, as it isn't always 0, and the key at slot 0 can
    have an offset much lower than our search offset/bytenr.
    
    For example, consider the following scenario, where we have:
    
    sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
    four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
    
      Leaf N:
    
        slot = 0                           slot = btrfs_header_nritems() - 1
      |-------------------------------------------------------------------|
      | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
      |-------------------------------------------------------------------|
    
      Leaf N + 1:
    
          slot = 0                          slot = btrfs_header_nritems() - 1
      |--------------------------------------------------------------------|
      | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
      |--------------------------------------------------------------------|
    
    Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
    find the next highest key, which releases the current path and then searches
    for that next key. However after releasing the path and before finding that
    next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
    to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
    btrfs_next_leaf() will returns us a path again with leaf N but with the slot
    pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
    is then:
    
        slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
      |----------------------------------------------------------------------------------------------------|
      | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
      |----------------------------------------------------------------------------------------------------|
    
    And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
    into the "insert:" label, which will set tmp to:
    
        tmp = min((sums->len - total_bytes) >> blocksize_bits,
            (next_offset - file_key.offset) >> blocksize_bits) =
        min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
        min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
    
    and
    
       ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
    
    In other words, we insert a new csum item in the tree with key
    (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
    for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
    because the item with key (CSUM CSUM 40161280) (the one that was moved from
    leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
    bytes of our data and won't get those old checksums removed.
    
    So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
    and breaks the logical rule:
    
       Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
    
    An obvious bad effect of this is that a subsequent csum tree lookup to get
    the checksum of any of the blocks with logical offset of 40161280, 40165376
    or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    27b9a812
file-item.c 26.2 KB