• Filipe Manana's avatar
    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit · d0c2f4fa
    Filipe Manana authored
    Often an fsync needs to fallback to a transaction commit for several
    reasons (to ensure consistency after a power failure, a new block group
    was allocated or a temporary error such as ENOMEM or ENOSPC happened).
    
    In that case the log is marked as needing a full commit and any concurrent
    tasks attempting to log inodes or commit the log will also fallback to the
    transaction commit. When this happens they all wait for the task that first
    started the transaction commit to finish the transaction commit - however
    they wait until the full transaction commit happens, which is not needed,
    as they only need to wait for the superblocks to be persisted and not for
    unpinning all the extents pinned during the transaction's lifetime, which
    even for short lived transactions can be a few thousand and take some
    significant amount of time to complete - for dbench workloads I have
    observed up to 4~5 milliseconds of time spent unpinning extents in the
    worst cases, and the number of pinned extents was between 2 to 3 thousand.
    
    So allow fsync tasks to skip waiting for the unpinning of extents when
    they call btrfs_commit_transaction() and they were not the task that
    started the transaction commit (that one has to do it, the alternative
    would be to offload the transaction commit to another task so that it
    could avoid waiting for the extent unpinning or offload the extent
    unpinning to another task).
    
    This patch is part of a patchset comprised of the following patches:
    
      btrfs: remove unnecessary directory inode item update when deleting dir entry
      btrfs: stop setting nbytes when filling inode item for logging
      btrfs: avoid logging new ancestor inodes when logging new inode
      btrfs: skip logging directories already logged when logging all parents
      btrfs: skip logging inodes already logged when logging new entries
      btrfs: remove unnecessary check_parent_dirs_for_sync()
      btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
    
    After applying the entire patchset, dbench shows improvements in respect
    to throughput and latency. The script used to measure it is the following:
    
      $ cat dbench-test.sh
      #!/bin/bash
    
      DEV=/dev/sdk
      MNT=/mnt/sdk
      MOUNT_OPTIONS="-o ssd"
      MKFS_OPTIONS="-m single -d single"
    
      echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
      umount $DEV &> /dev/null
      mkfs.btrfs -f $MKFS_OPTIONS $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
    
      dbench -D $MNT -t 300 64
    
      umount $MNT
    
    The test was run on a physical machine with 12 cores (Intel corei7), 64G
    of ram, using a NVMe device and a non-debug kernel configuration (Debian's
    default configuration).
    
    Before applying patchset, 32 clients:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    9627107     0.153    61.938
     Close        7072076     0.001     3.175
     Rename        407633     1.222    44.439
     Unlink       1943895     0.658    44.440
     Deltree          256    17.339   110.891
     Mkdir            128     0.003     0.009
     Qpathinfo    8725406     0.064    17.850
     Qfileinfo    1529516     0.001     2.188
     Qfsinfo      1599884     0.002     1.457
     Sfileinfo     784200     0.005     3.562
     Find         3373513     0.411    30.312
     WriteX       4802132     0.053    29.054
     ReadX       15089959     0.002     5.801
     LockX          31344     0.002     0.425
     UnlockX        31344     0.001     0.173
     Flush         674724     5.952   341.830
    
    Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms
    
    After applying patchset, 32 clients:
    
    After patchset, with 32 clients:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    9931568     0.111    25.597
     Close        7295730     0.001     2.171
     Rename        420549     0.982    49.714
     Unlink       2005366     0.497    39.015
     Deltree          256    11.149    89.242
     Mkdir            128     0.002     0.014
     Qpathinfo    9001863     0.049    20.761
     Qfileinfo    1577730     0.001     2.546
     Qfsinfo      1650508     0.002     3.531
     Sfileinfo     809031     0.005     5.846
     Find         3480259     0.309    23.977
     WriteX       4952505     0.043    41.283
     ReadX       15568127     0.002     5.476
     LockX          32338     0.002     0.978
     UnlockX        32338     0.001     2.032
     Flush         696017     7.485   228.835
    
    Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms
    
     --> +4.1% throughput, -39.6% max latency
    
    Before applying patchset, 64 clients:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    8956748     0.342   108.312
     Close        6579660     0.001     3.823
     Rename        379209     2.396    81.897
     Unlink       1808625     1.108   131.148
     Deltree          256    25.632   172.176
     Mkdir            128     0.003     0.018
     Qpathinfo    8117615     0.131    55.916
     Qfileinfo    1423495     0.001     2.635
     Qfsinfo      1488496     0.002     5.412
     Sfileinfo     729472     0.007     8.643
     Find         3138598     0.855    78.321
     WriteX       4470783     0.102    79.442
     ReadX       14038139     0.002     7.578
     LockX          29158     0.002     0.844
     UnlockX        29158     0.001     0.567
     Flush         627746    14.168   506.151
    
    Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms
    
    After applying patchset, 64 clients:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    9069003     0.303    43.193
     Close        6662328     0.001     3.888
     Rename        383976     2.194    46.418
     Unlink       1831080     1.022    43.873
     Deltree          256    24.037   155.763
     Mkdir            128     0.002     0.005
     Qpathinfo    8219173     0.137    30.233
     Qfileinfo    1441203     0.001     3.204
     Qfsinfo      1507092     0.002     4.055
     Sfileinfo     738775     0.006     5.431
     Find         3177874     0.936    38.170
     WriteX       4526152     0.084    39.518
     ReadX       14213562     0.002    24.760
     LockX          29522     0.002     1.221
     UnlockX        29522     0.001     0.694
     Flush         635652    14.358   422.039
    
    Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms
    
     --> +6.8% throughput, -18.1% max latency
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    d0c2f4fa
file.c 98.7 KB