• Filipe Manana's avatar
    btrfs: fix race leading to unnecessary transaction commit when logging inode · 639bd575
    Filipe Manana authored
    When logging an inode we may often have to fallback to a full transaction
    commit, either because a new block group was allocated, there is some case
    we can not deal with without a transaction commit or some error like an
    ENOMEM happened. However after we fallback to a transaction commit, we
    have a time window where we can make the next attempt to log any inode
    commit the next transaction unnecessarily, adding additional overhead and
    increasing latency.
    
    A sequence of steps that leads to this issue is the following:
    
    1) The current open transaction has a generation of 1000;
    
    2) A new block group is allocated, and as a consequence we must make sure
       any attempts to commit a log fallback to a transaction commit, so
       btrfs_set_log_full_commit() is called from btrfs_make_block_group().
       This sets fs_info->last_trans_log_full_commit to 1000;
    
    3) Task A is holding a handle on transaction 1000 and tries to log inode X.
       Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
       which returns true, since fs_info->last_trans_log_full_commit has a
       value of 1000. So we end up returning EAGAIN and propagating it up to
       btrfs_sync_file(), where we commit transaction 1000;
    
    4) The transaction commit task (task A) sets the transaction state to
       unblocked (TRANS_STATE_UNBLOCKED);
    
    5) Some other task, task B, starts a new transaction with a generation of
       1001;
    
    6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
    
    7) Transaction 1000 has not fully committed yet, we are still writing all
       the extent buffers it created;
    
    8) Some new task, task C, starts an fsync of inode Y, gets a handle for
       transaction 1001, and it gets to btrfs_log_inode_parent() which does
       the following check:
    
         if (fs_info->last_trans_log_full_commit > last_committed) {
             ret = 1;
             goto end_no_trans;
         }
    
       At that point last_trans_log_full_commit has a value of 1000 and
       last_committed (value of fs_info->last_trans_committed) has a value of
       999, since transaction 1000 has not yet committed - it is either still
       writing out dirty extent buffers, its super blocks or unpinning
       extents.
    
       As a consequence we return 1, which gets propagated up to
       btrfs_sync_file(), which will then call btrfs_commit_transaction()
       for transaction 1001.
    
       As a consequence we have an unnecessary second transaction commit, we
       previously committed transaction 1000 and now commit transaction 1001
       as well, resulting in more overhead and increased latency.
    
    So fix this double transaction commit issue simply by removing that check,
    because all we need to do is wait for the previous transaction to finish
    its commit, which we already do later when starting the log transaction at
    start_log_trans(), because there we acquire the tree_log_mutex lock, which
    is held by a transaction commit and only released after the transaction
    commits its super blocks.
    
    Another issue that check has is that it reads last_trans_log_full_commit
    without using READ_ONCE(), which is incorrect since that member of
    struct btrfs_fs_info is always updated with WRITE_ONCE() through the
    helper btrfs_set_log_full_commit().
    
    This double transaction commit issue can actually be triggered quite often
    in long runs of dbench, since besides the creation of new block groups
    that force inode logging to fallback to a transaction commit, there are
    cases where dbench asks to fsync a directory which had files in it that
    were previously renamed or subdirectories that were removed, resulting in
    the inode logging to fallback to a full transaction commit.
    
    This patch belongs to a patch set that is comprised of the following
    patches:
    
      btrfs: fix race causing unnecessary inode logging during link and rename
      btrfs: fix race that results in logging old extents during a fast fsync
      btrfs: fix race that causes unnecessary logging of ancestor inodes
      btrfs: fix race that makes inode logging fallback to transaction commit
      btrfs: fix race leading to unnecessary transaction commit when logging inode
      btrfs: do not block inode logging for so long during transaction commit
    
    Performance results are mentioned in the change log of the last patch.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    639bd575
tree-log.c 172 KB