• Filipe Manana's avatar
    Btrfs: fix race leading to metadata space leak after task received signal · 0cab7acc
    Filipe Manana authored
    When a task that is allocating metadata needs to wait for the async
    reclaim job to process its ticket and gets a signal (because it was killed
    for example) before doing the wait, the task ends up erroring out but
    with space reserved for its ticket, which never gets released, resulting
    in a metadata space leak (more specifically a leak in the bytes_may_use
    counter of the metadata space_info object).
    
    Here's the sequence of steps leading to the space leak:
    
    1) A task tries to create a file for example, so it ends up trying to
       start a transaction at btrfs_create();
    
    2) The filesystem is currently in a state where there is not enough
       metadata free space to satisfy the transaction's needs. So at
       space-info.c:__reserve_metadata_bytes() we create a ticket and
       add it to the list of tickets of the space info object. Also,
       because the metadata async reclaim job is not running, we queue
       a job ro run metadata reclaim;
    
    3) In the meanwhile the task receives a signal (like SIGTERM from
       a kill command for example);
    
    4) After queing the async reclaim job, at __reserve_metadata_bytes(),
       we unlock the metadata space info and call handle_reserve_ticket();
    
    5) That last function calls wait_reserve_ticket(), which acquires the
       lock from the metadata space info. Then in the first iteration of
       its while loop, it calls prepare_to_wait_event(), which returns
       -ERESTARTSYS because the task has a pending signal. As a result,
       we set the error field of the ticket to -EINTR and exit the while
       loop without deleting the ticket from the list of tickets (in the
       space info object). After exiting the loop we unlock the space info;
    
    6) The async reclaim job is able to release enough metadata, acquires
       the metadata space info's lock and then reserves space for the ticket,
       since the ticket is still in the list of (non-priority) tickets. The
       space reservation happens at btrfs_try_granting_tickets(), called from
       maybe_fail_all_tickets(). This increments the bytes_may_use counter
       from the metadata space info object, sets the ticket's bytes field to
       zero (meaning success, that space was reserved) and removes it from
       the list of tickets;
    
    7) wait_reserve_ticket() returns, with the error field of the ticket
       set to -EINTR. Then handle_reserve_ticket() just propagates that error
       to the caller. Because an error was returned, the caller does not
       release the reserved space, since the expectation is that any error
       means no space was reserved.
    
    Fix this by removing the ticket from the list, while holding the space
    info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
    an error.
    
    Also add some comments and an assertion to guarantee we never end up with
    a ticket that has an error set and a bytes counter field set to zero, to
    more easily detect regressions in the future.
    
    This issue could be triggered sporadically by some test cases from fstests
    such as generic/269 for example, which tries to fill a filesystem and then
    kills fsstress processes running in the background.
    
    When this issue happens, we get a warning in syslog/dmesg when unmounting
    the filesystem, like the following:
    
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
      (...)
      CPU: 0 PID: 13240 Comm: umount Tainted: G        W    L    5.3.0-rc8-btrfs-next-48+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
      RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
      (...)
      RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
      RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
      RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
      R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
      R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
      FS:  00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       close_ctree+0x1ad/0x390 [btrfs]
       generic_shutdown_super+0x6c/0x110
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0xa0 [btrfs]
       deactivate_locked_super+0x3a/0x70
       cleanup_mnt+0xb4/0x160
       task_work_run+0x7e/0xc0
       exit_to_usermode_loop+0xfa/0x100
       do_syscall_64+0x1cb/0x220
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f4274d2cb37
      (...)
      RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
      RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
      RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
      R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
      R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
      irq event stamp: 0
      hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
      softirqs last  enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
      softirqs last disabled at (0): [<0000000000000000>] 0x0
      ---[ end trace bcf4b235461b26f6 ]---
      BTRFS info (device sdb): space_info 4 has 19116032 free, is full
      BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
      BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
      BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
      BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
      BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
      BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
    
    Fixes: 374bf9c5 ("btrfs: unify error handling for ticket flushing")
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    0cab7acc
space-info.c 31.8 KB