• Boris Burkov's avatar
    btrfs: introduce size class to block group allocator · 52bb7a21
    Boris Burkov authored
    The aim of this patch is to reduce the fragmentation of block groups
    under certain unhappy workloads. It is particularly effective when the
    size of extents correlates with their lifetime, which is something we
    have observed causing fragmentation in the fleet at Meta.
    
    This patch categorizes extents into size classes:
    
    - x < 128KiB: "small"
    - 128KiB < x < 8MiB: "medium"
    - x > 8MiB: "large"
    
    and as much as possible reduces allocations of extents into block groups
    that don't match the size class. This takes advantage of any (possible)
    correlation between size and lifetime and also leaves behind predictable
    re-usable gaps when extents are freed; small writes don't gum up bigger
    holes.
    
    Size classes are implemented in the following way:
    
    - Mark each new block group with a size class of the first allocation
      that goes into it.
    
    - Add two new passes to ffe: "unset size class" and "wrong size class".
      First, try only matching block groups, then try unset ones, then allow
      allocation of new ones, and finally allow mismatched block groups.
    
    - Filtering is done just by skipping inappropriate ones, there is no
      special size class indexing.
    
    Other solutions I considered were:
    
    - A best fit allocator with an rb-tree. This worked well, as small
      writes didn't leak big holes from large freed extents, but led to
      regressions in ffe and write performance due to lock contention on
      the rb-tree with every allocation possibly updating it in parallel.
      Perhaps something clever could be done to do the updates in the
      background while being "right enough".
    
    - A fixed size "working set". This prevents freeing an extent
      drastically changing where writes currently land, and seems like a
      good option too. Doesn't take advantage of size in any way.
    
    - The same size class idea, but implemented with xarray marks. This
      turned out to be slower than looping the linked list and skipping
      wrong block groups, and is also less flexible since we must have only
      3 size classes (max #marks). With the current approach we can have as
      many as we like.
    
    Performance testing was done via: https://github.com/josefbacik/fsperf
    Of particular relevance are the new fragmentation specific tests.
    
    A brief summary of the testing results:
    
    - Neutral results on existing tests. There are some minor regressions
      and improvements here and there, but nothing that truly stands out as
      notable.
    - Improvement on new tests where size class and extent lifetime are
      correlated. Fragmentation in these cases is completely eliminated
      and write performance is generally a little better. There is also
      significant improvement where extent sizes are just a bit larger than
      the size class boundaries.
    - Regression on one new tests: where the allocations are sized
      intentionally a hair under the borders of the size classes. Results
      are neutral on the test that intentionally attacks this new scheme by
      mixing extent size and lifetime.
    
    The full dump of the performance results can be found here:
    https://bur.io/fsperf/size-class-2022-11-15.txt
    (there are ANSI escape codes, so best to curl and view in terminal)
    
    Here is a snippet from the full results for a new test which mixes
    buffered writes appending to a long lived set of files and large short
    lived fallocates:
    
    bufferedappendvsfallocate results
             metric             baseline       current        stdev            diff
    ======================================================================================
    avg_commit_ms                    31.13         29.20          2.67     -6.22%
    bg_count                            14         15.60             0     11.43%
    commits                          11.10         12.20          0.32      9.91%
    elapsed                          27.30         26.40          2.98     -3.30%
    end_state_mount_ns         11122551.90   10635118.90     851143.04     -4.38%
    end_state_umount_ns           1.36e+09      1.35e+09   12248056.65     -1.07%
    find_free_extent_calls       116244.30     114354.30        964.56     -1.63%
    find_free_extent_ns_max      599507.20    1047168.20     103337.08     74.67%
    find_free_extent_ns_mean       3607.19       3672.11        101.20      1.80%
    find_free_extent_ns_min            500           512          6.67      2.40%
    find_free_extent_ns_p50           2848          2876         37.65      0.98%
    find_free_extent_ns_p95           4916          5000         75.45      1.71%
    find_free_extent_ns_p99       20734.49      20920.48       1670.93      0.90%
    frag_pct_max                     61.67             0          8.05   -100.00%
    frag_pct_mean                    43.59             0          6.10   -100.00%
    frag_pct_min                     25.91             0         16.60   -100.00%
    frag_pct_p50                     42.53             0          7.25   -100.00%
    frag_pct_p95                     61.67             0          8.05   -100.00%
    frag_pct_p99                     61.67             0          8.05   -100.00%
    fragmented_bg_count               6.10             0          1.45   -100.00%
    max_commit_ms                    49.80            46          5.37     -7.63%
    sys_cpu                           2.59          2.62          0.29      1.39%
    write_bw_bytes                1.62e+08      1.68e+08   17975843.50      3.23%
    write_clat_ns_mean            57426.39      54475.95       2292.72     -5.14%
    write_clat_ns_p50             46950.40      42905.60       2101.35     -8.62%
    write_clat_ns_p99            148070.40     143769.60       2115.17     -2.90%
    write_io_kbytes                4194304       4194304             0      0.00%
    write_iops                     2476.15       2556.10        274.29      3.23%
    write_lat_ns_max            2101667.60    2251129.50     370556.59      7.11%
    write_lat_ns_mean             59374.91      55682.00       2523.09     -6.22%
    write_lat_ns_min              17353.10         16250       1646.08     -6.36%
    
    There are some mixed improvements/regressions in most metrics along with
    an elimination of fragmentation in this workload.
    
    On the balance, the drastic 1->0 improvement in the happy cases seems
    worth the mix of regressions and improvements we do observe.
    
    Some considerations for future work:
    
    - Experimenting with more size classes
    - More hinting/search ordering work to approximate a best-fit allocator
    Signed-off-by: default avatarBoris Burkov <boris@bur.io>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    52bb7a21
btrfs.h 67.5 KB