• Filipe Manana's avatar
    btrfs: do not set the full sync flag on the inode during page release · 5e548b32
    Filipe Manana authored
    When removing an extent map at try_release_extent_mapping(), called through
    the page release callback (btrfs_releasepage()), we always set the full
    sync flag on the inode, which forces the next fsync to use a slower code
    path.
    
    This hurts performance for workloads that dirty an amount of data that
    exceeds or is very close to the system's RAM memory and do frequent fsync
    operations (like database servers can for example). In particular if there
    are concurrent fsyncs against different files, by falling back to a full
    fsync we do a lot more checksum lookups in the checksums btree, as we do
    it for all the extents created in the current transaction, instead of only
    the new ones since the last fsync. These checksums lookups not only take
    some time but, more importantly, they also cause contention on the
    checksums btree locks due to the concurrency with checksum insertions in
    the btree by ordered extents from other inodes.
    
    We actually don't need to set the full sync flag on the inode, because we
    only remove extent maps that are in the list of modified extents if they
    were created in a past transaction, in which case an fsync skips them as
    it's pointless to log them. So stop setting the full fsync flag on the
    inode whenever we remove an extent map.
    
    This patch is part of a patchset that consists of 3 patches, which have
    the following subjects:
    
    1/3 btrfs: fix race between page release and a fast fsync
    2/3 btrfs: release old extent maps during page release
    3/3 btrfs: do not set the full sync flag on the inode during page release
    
    Performance tests were ran against a branch (misc-next) containing the
    whole patchset. The test exercises a workload where there are multiple
    processes writing to files and fsyncing them (each writing and fsyncing
    its own file), and in total the amount of data dirtied ranges from 2x to
    4x the system's RAM memory (16GiB), so that the page release callback is
    invoked frequently.
    
    The following script, using fio, was used to perform the tests:
    
      $ cat test-fsync.sh
      #!/bin/bash
    
      DEV=/dev/sdk
      MNT=/mnt/sdk
      MOUNT_OPTIONS="-o ssd"
      MKFS_OPTIONS="-d single -m single"
    
      if [ $# -ne 3 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
          exit 1
      fi
    
      NUM_JOBS=$1
      FILE_SIZE=$2
      FSYNC_FREQ=$3
    
      cat <<EOF > /tmp/fio-job.ini
      [writers]
      rw=write
      fsync=$FSYNC_FREQ
      fallocate=none
      group_reporting=1
      direct=0
      bs=64k
      ioengine=sync
      size=$FILE_SIZE
      directory=$MNT
      numjobs=$NUM_JOBS
      thread
      EOF
    
      echo "Using config:"
      echo
      cat /tmp/fio-job.ini
      echo
    
      mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
      mount $MOUNT_OPTIONS $DEV $MNT
      fio /tmp/fio-job.ini
      umount $MNT
    
    The tests were performed for different numbers of jobs, file sizes and
    fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
    12 cores, with cpu governance set to performance mode on all cores), 16GiB
    of ram (the host has 64GiB) and using a NVMe device directly (without an
    intermediary filesystem in the host). While running the tests, the host
    was not used for anything else, to avoid disturbing the tests.
    
    The obtained results were the following, and the last line printed by
    fio is pasted (includes aggregated throughput and test run time).
    
        *****************************************************
        ****     1 job, 32GiB file, fsync frequency 1     ****
        *****************************************************
    
    Before patchset:
    
    WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
    
    After patchset:
    
    WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
    (+0.7% throughput, -0.8% run time)
    
        *****************************************************
        ****     2 jobs, 16GiB files, fsync frequency 1   ****
        *****************************************************
    
    Before patchset:
    
    WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
    
    After patchset:
    
    WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
    (+19.1% throughput, -16.1% runtime)
    
        *****************************************************
        ****     4 jobs, 8GiB files, fsync frequency 1    ****
        *****************************************************
    
    Before patchset:
    
    WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
    
    After patchset:
    
    WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
    (+37.8% throughput, -27.5% runtime)
    
        *****************************************************
        ****     8 jobs, 4GiB files, fsync frequency 1    ****
        *****************************************************
    
    Before patchset:
    
    WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
    
    After patchset:
    
    WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
    (+74.7% throughput, -43.0% run time)
    
        *****************************************************
        ****    16 jobs, 2GiB files, fsync frequency 1    ****
        *****************************************************
    
    Before patchset:
    
    WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
    
    After patchset:
    
    WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
    (+146.3% throughput, -59.5% run time)
    
        *****************************************************
        ****    32 jobs, 2GiB files, fsync frequency 1    ****
        *****************************************************
    
    Before patchset:
    
    WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
    
    After patchset:
    
    WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
    (+212.7% throughput, -68.0% run time)
    
        *****************************************************
        ****    64 jobs, 1GiB files, fsync frequency 1    ****
        *****************************************************
    
    Before patchset:
    
    WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
    
    After patchset:
    
    WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
    (+197.6% throughput, -66.4% run time)
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    5e548b32
extent_io.c 158 KB