• Filipe Manana's avatar
    btrfs: only copy dir index keys when logging a directory · 339d0354
    Filipe Manana authored
    Currently, when logging a directory, we copy both dir items and dir index
    items from the fs/subvolume tree to the log tree. Both items have exactly
    the same data (same struct btrfs_dir_item), the difference lies in the key
    values, where a dir index key contains the index number of a directory
    entry while the dir item key does not, as it's used for doing fast lookups
    of an entry by name, while the former is used for sorting entries when
    listing a directory.
    
    We can exploit that and log only the dir index items, since they contain
    all the information needed to correctly add, replace and delete directory
    entries when replaying a log tree. Logging only the dir index items is
    also backward and forward compatible: an unpatched kernel (without this
    change) can correctly replay a log tree generated by a patched kernel
    (with this patch), and a patched kernel can correctly replay a log tree
    generated by an unpatched kernel.
    
    The backward compatibility is ensured because:
    
    1) For inserting a new dentry: a dentry is only inserted when we find a
       new dir index key - we can only insert if we know the dir index offset,
       which is encoded in the dir index key's offset;
    
    2) For deleting dentries: during log replay, before adding or replacing
       dentries, we first replay dentry deletions. Whenever we find a dir item
       key or a dir index key in the subvolume/fs tree that is not logged in
       a range for which the log tree is authoritative, we do the unlink of
       the dentry, which removes both the existing dir item key and the dir
       index key. Therefore logging just dir index keys is enough to ensure
       dentry deletions are correctly replayed;
    
    3) For dentry replacements: they work when we log only dir index keys
       and this is mostly due to a combination of 1) and 2). If we replace a
       dentry with name "foobar" to point from inode A to inode B, then we
       know the dir index key for the new dentry is different from the old
       one, as it has an index number (key offset) larger than the old one.
       This results in replaying a deletion, through replay_dir_deletes(),
       that causes the old dentry to be removed, both the dir item key and
       the dir index key, as mentioned at 2). Then when processing the new
       dir index key, we add the new dentry, adding both a new dir item key
       and a new index key pointing to inode B, as stated in 1).
    
    The forward compatibility, the ability for a patched kernel to replay a
    log created by an older, unpatched kernel, comes from the changes required
    for making sure we are able to replay a log that only contains dir index
    keys - we simply ignore every dir item key we find.
    
    So modify directory logging to log only dir index items, and modify the
    log replay process to ignore dir item keys, from log trees created by an
    unpatched kernel, and process only with dir index keys. This reduces the
    amount of logged metadata by about half, and therefore the time spent
    logging or fsyncing large directories (less CPU time and less IO).
    
    The following test script was used to measure this change:
    
       #!/bin/bash
    
       DEV=/dev/nvme0n1
       MNT=/mnt/nvme0n1
    
       NUM_NEW_FILES=1000000
       NUM_FILE_DELETES=10000
    
       mkfs.btrfs -f $DEV
       mount -o ssd $DEV $MNT
    
       mkdir $MNT/testdir
    
       for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
               echo -n > $MNT/testdir/file_$i
       done
    
       start=$(date +%s%N)
       xfs_io -c "fsync" $MNT/testdir
       end=$(date +%s%N)
    
       dur=$(( (end - start) / 1000000 ))
       echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
    
       # sync to force transaction commit and wipeout the log.
       sync
    
       del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
       for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
               rm -f $MNT/testdir/file_$i
       done
    
       start=$(date +%s%N)
       xfs_io -c "fsync" $MNT/testdir
       end=$(date +%s%N)
    
       dur=$(( (end - start) / 1000000 ))
       echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
       echo
    
       umount $MNT
    
    The tests were run on a physical machine, with a non-debug kernel (Debian's
    default kernel config), for different values of $NUM_NEW_FILES and
    $NUM_FILE_DELETES, and the results were the following:
    
    ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
    
    dir fsync took 8412 ms after adding 1000000 files
    dir fsync took 500 ms after deleting 10000 files
    
    ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
    
    dir fsync took 4252 ms after adding 1000000 files   (-49.5%)
    dir fsync took 269 ms after deleting 10000 files    (-46.2%)
    
    ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
    
    dir fsync took 745 ms after adding 100000 files
    dir fsync took 59 ms after deleting 1000 files
    
    ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
    
    dir fsync took 404 ms after adding 100000 files   (-45.8%)
    dir fsync took 31 ms after deleting 1000 files    (-47.5%)
    
    ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
    
    dir fsync took 67 ms after adding 10000 files
    dir fsync took 9 ms after deleting 1000 files
    
    ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
    
    dir fsync took 36 ms after adding 10000 files   (-46.3%)
    dir fsync took 5 ms after deleting 1000 files   (-44.4%)
    
    ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
    
    dir fsync took 9 ms after adding 1000 files
    dir fsync took 4 ms after deleting 100 files
    
    ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
    
    dir fsync took 7 ms after adding 1000 files     (-22.2%)
    dir fsync took 3 ms after deleting 100 files    (-25.0%)
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    339d0354
tree-log.c 184 KB