• Zygo Blaxell's avatar
    btrfs: fix stripe length calculation for non-zoned data chunk allocation · 8a540e99
    Zygo Blaxell authored
    Commit f6fca391 "btrfs: store chunk size in space-info struct"
    broke data chunk allocations on non-zoned multi-device filesystems when
    using default chunk_size.  Commit 5da431b7 "btrfs: fix the max chunk
    size and stripe length calculation" partially fixed that, and this patch
    completes the fix for that case.
    
    After commit f6fca391 and 5da431b7, the sequence of events for
    a data chunk allocation on a non-zoned filesystem is:
    
            1.  btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
            space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
            unmodified.  Before f6fca391, ctl->max_stripe_len value was
            1 GiB for non-zoned data chunks and not configurable.
    
            2.  btrfs_create_chunk calls gather_device_info which consumes
            and produces more fields of chunk_ctl.
    
            3.  gather_device_info multiplies ctl->max_stripe_len by
            ctl->dev_stripes (which is 1 in all cases except dup)
            and calls find_free_dev_extent with that number as num_bytes.
    
            4.  find_free_dev_extent locates the first dev_extent hole on
            a device which is at least as large as num_bytes.  With default
            max_chunk_size from f6fca391, it finds the first hole which is
            longer than 10 GiB, or the largest hole if that hole is shorter
            than 10 GiB.  This is different from the pre-f6fca391
            behavior, where num_bytes is 1 GiB, and find_free_dev_extent
            may choose a different hole.
    
            5.  gather_device_info repeats step 4 with all devices to find
            the first or largest dev_extent hole that can be allocated on
            each device.
    
            6.  gather_device_info sorts the device list by the hole size
            on each device, using total unallocated space on each device to
            break ties, then returns to btrfs_create_chunk with the list.
    
            7.  btrfs_create_chunk calls decide_stripe_size_regular.
    
            8.  decide_stripe_size_regular finds the largest stripe_len that
            fits across the first nr_devs device dev_extent holes that were
            found by gather_device_info (and satisfies other constraints
            on stripe_len that are not relevant here).
    
            9.  decide_stripe_size_regular caps the length of the stripe it
            computed at 1 GiB.  This cap appeared in 5da431b7 to correct
            one of the other regressions introduced in f6fca391.
    
            10.  btrfs_create_chunk creates a new chunk with the above
            computed size and number of devices.
    
    At step 4, gather_device_info() has found a location where stripe up to
    10 GiB in length could be allocated on several devices, and selected
    which devices should have a dev_extent allocated on them, but at step
    9, only 1 GiB of the space that was found on each device can be used.
    This mismatch causes new suboptimal chunk allocation cases that did not
    occur in pre-f6fca391 kernels.
    
    Consider a filesystem using raid1 profile with 3 devices.  After some
    balances, device 1 has 10x 1 GiB unallocated space, while devices 2
    and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
    space, but distributed across different numbers of dev_extent holes.
    For visualization, let's ignore all the chunks that were allocated before
    this point, and focus on the remaining holes:
    
            Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
            Device 2:  [__________] (10 GiB contig unallocated)
            Device 3:  [__________] (10 GiB contig unallocated)
    
    Before f6fca391, the allocator would fill these optimally by
    allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
    ([13]), or 2 and 3 ([23]):
    
            [after 0 chunk allocations]
            Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
            Device 2:  [__________] (10 GiB)
            Device 3:  [__________] (10 GiB)
    
            [after 1 chunk allocation]
            Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
            Device 2:  [12] [_________] (9 GiB)
            Device 3:  [__________] (10 GiB)
    
            [after 2 chunk allocations]
            Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
            Device 2:  [12] [_________] (9 GiB)
            Device 3:  [13] [_________] (9 GiB)
    
            [after 3 chunk allocations]
            Device 1:  [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
            Device 2:  [12] [12] [________] (8 GiB)
            Device 3:  [13] [_________] (9 GiB)
    
            [...]
    
            [after 12 chunk allocations]
            Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
            Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
            Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
    
            [after 13 chunk allocations]
            Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
            Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
            Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
    
            [after 14 chunk allocations]
            Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
            Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
            Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)
    
            [after 15 chunk allocations]
            Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
            Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
            Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)
    
    This allocates all of the space with no waste.  The sorting function used
    by gather_device_info considers free space holes above 1 GiB in length
    to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
    long hole on each device, all the holes appear equal in the sort, and the
    comparison falls back to sorting devices by total free space.  This keeps
    usable space on each device equal so they can all be filled completely.
    
    After f6fca391, the allocator prefers the devices with larger holes
    over the devices with more free space, so it makes bad allocation choices:
    
            [after 1 chunk allocation]
            Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
            Device 2:  [23] [_________] (9 GiB)
            Device 3:  [23] [_________] (9 GiB)
    
            [after 2 chunk allocations]
            Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
            Device 2:  [23] [23] [________] (8 GiB)
            Device 3:  [23] [23] [________] (8 GiB)
    
            [after 3 chunk allocations]
            Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
            Device 2:  [23] [23] [23] [_______] (7 GiB)
            Device 3:  [23] [23] [23] [_______] (7 GiB)
    
            [...]
    
            [after 9 chunk allocations]
            Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
            Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
            Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
    
            [after 10 chunk allocations]
            Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
            Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
            Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
    
            [after 11 chunk allocations]
            Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
            Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
            Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)
    
    No further allocations are possible, with 8 GiB wasted (4 GiB of data
    space).  The sort in gather_device_info now considers free space in
    holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
    3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
    At that point, with only 1 GiB unallocated on every device, the largest
    hole length on each device is equal at 1 GiB, so the sort finally moves
    to ordering the devices with the most free space, but by this time it
    is too late to make use of the free space on device 1.
    
    Note that it's possible to contrive a case where the pre-f6fca391
    allocator fails the same way, but these cases generally have extensive
    dev_extent fragmentation as a precondition (e.g. many holes of 768M
    in length on one device, and few holes 1 GiB in length on the others).
    With the regression in f6fca391, bad chunk allocation can occur even
    under optimal conditions, when all dev_extent holes are exact multiples
    of stripe_len in length, as in the example above.
    
    Also note that post-f6fca391 kernels do treat dev_extent holes
    larger than 10 GiB as equal, so the bad behavior won't show up on a
    freshly formatted filesystem; however, as the filesystem ages and fills
    up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
    can show up as a failure to balance after adding or removing devices,
    or an unexpected shortfall in available space due to unequal allocation.
    
    To fix the regression and make data chunk allocation work
    again, set ctl->max_stripe_len back to the original SZ_1G, or
    space_info->chunk_size if that's smaller (the latter can happen if the
    user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
    a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).
    
    While researching the background of the earlier commits, I found that an
    identical fix was already proposed at:
    
      https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/
    
    The previous review missed one detail:  ctl->max_stripe_len is used
    before decide_stripe_size_regular() is called, when it is too late for
    the changes in that function to have any effect.  ctl->max_stripe_len is
    not used directly by decide_stripe_size_regular(), but the parameter
    does heavily influence the per-device free space data presented to
    the function.
    
    Fixes: f6fca391 ("btrfs: store chunk size in space-info struct")
    CC: stable@vger.kernel.org # 6.1+
    Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
    
    Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    8a540e99
volumes.c 217 KB