An error occurred fetching the project authors.
  1. 25 May, 2020 4 commits
    • Josef Bacik's avatar
      btrfs: only check priority tickets for priority flushing · 666daa9f
      Josef Bacik authored
      In debugging a generic/320 failure on ppc64, Nikolay noticed that
      sometimes we'd ENOSPC out with plenty of space to reclaim if we had
      committed the transaction.  He further discovered that this was because
      there was a priority ticket that was small enough to fit in the free
      space currently in the space_info.
      
      Consider the following scenario.  There is no more space to reclaim in
      the fs without committing the transaction.  Assume there's 1MiB of space
      free in the space info, but there are pending normal tickets with 2MiB
      reservations.
      
      Now a priority ticket comes in with a .5MiB reservation.  Because we
      have normal tickets pending we add ourselves to the priority list,
      despite the fact that we could satisfy this reservation.
      
      The flushing machinery now gets to the point where it wants to commit
      the transaction, but because there's a .5MiB ticket on the priority list
      and we have 1MiB of free space we assume the ticket will be granted
      soon, so we bail without committing the transaction.
      
      Meanwhile the priority flushing does not commit the transaction, and
      eventually fails with an ENOSPC.  Then all other tickets are failed with
      ENOSPC because we were never able to actually commit the transaction.
      
      The fix for this is we should have simply granted the priority flusher
      his reservation, because there was space to make the reservation.
      Priority flushers by definition take priority, so they are allowed to
      make their reservations before any previous normal tickets.  By not
      adding this priority ticket to the list the normal flushing mechanisms
      will then commit the transaction and everything will continue normally.
      
      We still need to serialize ourselves with other priority tickets, so if
      there are any tickets on the priority list then we need to add ourselves
      to that list in order to maintain the serialization between priority
      tickets.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      666daa9f
    • Josef Bacik's avatar
      btrfs: account for trans_block_rsv in may_commit_transaction · bb4f58a7
      Josef Bacik authored
      On ppc64le with 64k page size (respectively 64k block size) generic/320
      was failing and debug output showed we were getting a premature ENOSPC
      with a bunch of space in btrfs_fs_info::trans_block_rsv.
      
      This meant there were still open transaction handles holding space, yet
      the flusher didn't commit the transaction because it deemed the freed
      space won't be enough to satisfy the current reserve ticket. Fix this
      by accounting for space in trans_block_rsv when deciding whether the
      current transaction should be committed or not.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb4f58a7
    • Josef Bacik's avatar
      btrfs: allow to use up to 90% of the global block rsv for unlink · e6549c2a
      Josef Bacik authored
      We previously had a limit of stealing 50% of the global reserve for
      unlink.  This was from a time when the global reserve was used for the
      delayed refs as well.  However now those reservations are kept separate,
      so the global reserve can be depleted much more to allow us to make
      progress for space restoring operations like unlink.  Change the minimum
      amount of space required to be left in the global reserve to 10%.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6549c2a
    • Josef Bacik's avatar
      btrfs: improve global reserve stealing logic · 7f9fe614
      Josef Bacik authored
      For unlink transactions and block group removal
      btrfs_start_transaction_fallback_global_rsv will first try to start an
      ordinary transaction and if it fails it will fall back to reserving the
      required amount by stealing from the global reserve. This is problematic
      because of all the same reasons we had with previous iterations of the
      ENOSPC handling, thundering herd.  We get a bunch of failures all at
      once, everybody tries to allocate from the global reserve, some win and
      some lose, we get an ENSOPC.
      
      Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
      used to mark unlink reservation. To fix this we need to integrate this
      logic into the normal ENOSPC infrastructure.  We still go through all of
      the normal flushing work, and at the moment we begin to fail all the
      tickets we try to satisfy any tickets that are allowed to steal by
      stealing from the global reserve.  If this works we start the flushing
      system over again just like we would with a normal ticket satisfaction.
      This serializes our global reserve stealing, so we don't have the
      thundering herd problem.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f9fe614
  2. 08 Apr, 2020 1 commit
    • Filipe Manana's avatar
      btrfs: fix reclaim counter leak of space_info objects · d611add4
      Filipe Manana authored
      Whenever we add a ticket to a space_info object we increment the object's
      reclaim_size counter witht the ticket's bytes, and we decrement it with
      the corresponding amount only when we are able to grant the requested
      space to the ticket. When we are not able to grant the space to a ticket,
      or when the ticket is removed due to a signal (e.g. an application has
      received sigterm from the terminal) we never decrement the counter with
      the corresponding bytes from the ticket. This leak can result in the
      space reclaim code to later do much more work than necessary. So fix it
      by decrementing the counter when those two cases happen as well.
      
      Fixes: db161806 ("btrfs: account ticket size at add/delete time")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d611add4
  3. 23 Mar, 2020 3 commits
    • Nikolay Borisov's avatar
      btrfs: account ticket size at add/delete time · db161806
      Nikolay Borisov authored
      Instead of iterating all pending tickets on the normal/priority list to
      sum their total size the cost can be amortized across ticket addition/
      removal. This turns O(n) + O(m) (where n is the size of the normal list
      and m of the priority list) into O(1). This will mostly have effect in
      workloads that experience heavy flushing.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db161806
    • Josef Bacik's avatar
      btrfs: fix btrfs_calc_reclaim_metadata_size calculation · fa121a26
      Josef Bacik authored
      I noticed while running my snapshot torture test that we were getting a
      lot of metadata chunks allocated with very little actually used.
      Digging into this we would commit the transaction, still not have enough
      space, and then force a chunk allocation.
      
      I noticed that we were barely flushing any delalloc at all, despite the
      fact that we had around 13gib of outstanding delalloc reservations.  It
      turns out this is because of our btrfs_calc_reclaim_metadata_size()
      calculation.  It _only_ takes into account the outstanding ticket sizes,
      which isn't the whole story.  In this particular workload we're slowly
      filling up the disk, which means our overcommit space will suddenly
      become a lot less, and our outstanding reservations will be well more
      than what we can handle.  However we are only flushing based on our
      ticket size, which is much less than we need to actually reclaim.
      
      So fix btrfs_calc_reclaim_metadata_size() to take into account the
      overage in the case that we've gotten less available space suddenly.
      This makes it so we attempt to reclaim a lot more delalloc space, which
      allows us to make our reservations and we no longer are allocating a
      bunch of needless metadata chunks.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa121a26
    • Josef Bacik's avatar
      btrfs: describe the space reservation system in general · 4b8b0528
      Josef Bacik authored
      Add another comment to cover how the space reservation system works
      generally.  This covers the actual reservation flow, as well as how
      flushing is handled.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4b8b0528
  4. 31 Jan, 2020 1 commit
    • Josef Bacik's avatar
      btrfs: take overcommit into account in inc_block_group_ro · a30a3d20
      Josef Bacik authored
      inc_block_group_ro does a calculation to see if we have enough room left
      over if we mark this block group as read only in order to see if it's ok
      to mark the block group as read only.
      
      The problem is this calculation _only_ works for data, where our used is
      always less than our total.  For metadata we will overcommit, so this
      will almost always fail for metadata.
      
      Fix this by exporting btrfs_can_overcommit, and then see if we have
      enough space to remove the remaining free space in the block group we
      are trying to mark read only.  If we do then we can mark this block
      group as read only.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a30a3d20
  5. 20 Jan, 2020 1 commit
  6. 18 Nov, 2019 5 commits
  7. 25 Oct, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix race leading to metadata space leak after task received signal · 0cab7acc
      Filipe Manana authored
      When a task that is allocating metadata needs to wait for the async
      reclaim job to process its ticket and gets a signal (because it was killed
      for example) before doing the wait, the task ends up erroring out but
      with space reserved for its ticket, which never gets released, resulting
      in a metadata space leak (more specifically a leak in the bytes_may_use
      counter of the metadata space_info object).
      
      Here's the sequence of steps leading to the space leak:
      
      1) A task tries to create a file for example, so it ends up trying to
         start a transaction at btrfs_create();
      
      2) The filesystem is currently in a state where there is not enough
         metadata free space to satisfy the transaction's needs. So at
         space-info.c:__reserve_metadata_bytes() we create a ticket and
         add it to the list of tickets of the space info object. Also,
         because the metadata async reclaim job is not running, we queue
         a job ro run metadata reclaim;
      
      3) In the meanwhile the task receives a signal (like SIGTERM from
         a kill command for example);
      
      4) After queing the async reclaim job, at __reserve_metadata_bytes(),
         we unlock the metadata space info and call handle_reserve_ticket();
      
      5) That last function calls wait_reserve_ticket(), which acquires the
         lock from the metadata space info. Then in the first iteration of
         its while loop, it calls prepare_to_wait_event(), which returns
         -ERESTARTSYS because the task has a pending signal. As a result,
         we set the error field of the ticket to -EINTR and exit the while
         loop without deleting the ticket from the list of tickets (in the
         space info object). After exiting the loop we unlock the space info;
      
      6) The async reclaim job is able to release enough metadata, acquires
         the metadata space info's lock and then reserves space for the ticket,
         since the ticket is still in the list of (non-priority) tickets. The
         space reservation happens at btrfs_try_granting_tickets(), called from
         maybe_fail_all_tickets(). This increments the bytes_may_use counter
         from the metadata space info object, sets the ticket's bytes field to
         zero (meaning success, that space was reserved) and removes it from
         the list of tickets;
      
      7) wait_reserve_ticket() returns, with the error field of the ticket
         set to -EINTR. Then handle_reserve_ticket() just propagates that error
         to the caller. Because an error was returned, the caller does not
         release the reserved space, since the expectation is that any error
         means no space was reserved.
      
      Fix this by removing the ticket from the list, while holding the space
      info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
      an error.
      
      Also add some comments and an assertion to guarantee we never end up with
      a ticket that has an error set and a bytes counter field set to zero, to
      more easily detect regressions in the future.
      
      This issue could be triggered sporadically by some test cases from fstests
      such as generic/269 for example, which tries to fill a filesystem and then
      kills fsstress processes running in the background.
      
      When this issue happens, we get a warning in syslog/dmesg when unmounting
      the filesystem, like the following:
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        CPU: 0 PID: 13240 Comm: umount Tainted: G        W    L    5.3.0-rc8-btrfs-next-48+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
        RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
        RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
        R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
        FS:  00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x1ad/0x390 [btrfs]
         generic_shutdown_super+0x6c/0x110
         kill_anon_super+0xe/0x30
         btrfs_kill_super+0x12/0xa0 [btrfs]
         deactivate_locked_super+0x3a/0x70
         cleanup_mnt+0xb4/0x160
         task_work_run+0x7e/0xc0
         exit_to_usermode_loop+0xfa/0x100
         do_syscall_64+0x1cb/0x220
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f4274d2cb37
        (...)
        RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
        RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
        R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
        R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last  enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace bcf4b235461b26f6 ]---
        BTRFS info (device sdb): space_info 4 has 19116032 free, is full
        BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
        BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
      
      Fixes: 374bf9c5 ("btrfs: unify error handling for ticket flushing")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cab7acc
  8. 09 Sep, 2019 18 commits
  9. 02 Jul, 2019 6 commits