1. 25 Jul, 2022 9 commits
    • David Sterba's avatar
      btrfs: send: remove old TODO regarding ERESTARTSYS · cec3dad9
      David Sterba authored
      
      The whole send operation is restartable and handling properly a buffer
      write may not be easy. We can't know what caused that and if a short
      delay and retry will fix it or how many retries should be performed in
      case it's a temporary condition.
      
      The error value is returned to the ioctl caller so in case it's
      transient problem, the user would be notified about the reason. Remove
      the TODO note as there's no plan to handle ERESTARTSYS.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cec3dad9
    • David Sterba's avatar
      btrfs: send: simplify includes · 8234d3f6
      David Sterba authored
      
      We don't need the whole ctree.h in send.h, none of the data types
      defined there are used.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8234d3f6
    • Omar Sandoval's avatar
      btrfs: send: enable support for stream v2 and compressed writes · d6815592
      Omar Sandoval authored
      
      Now that the new support is implemented, allow the ioctl to accept v2
      and the compressed flag, and update the version in sysfs.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6815592
    • Omar Sandoval's avatar
      btrfs: send: send compressed extents with encoded writes · 3ea4dc5b
      Omar Sandoval authored
      
      Now that all of the pieces are in place, we can use the ENCODED_WRITE
      command to send compressed extents when appropriate.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ea4dc5b
    • Omar Sandoval's avatar
      btrfs: send: get send buffer pages for protocol v2 · a4b333f2
      Omar Sandoval authored
      
      For encoded writes in send v2, we will get the encoded data with
      btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
      pages. To avoid extra buffers and copies, we should read directly into
      the send buffer. Therefore, we need the raw pages for the send buffer.
      
      We currently allocate the send buffer with kvmalloc(), which may return
      a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
      pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
      However, the buffer size we use (144K) is not a power of two, which in
      theory is not guaranteed to return a page-aligned buffer, and in
      practice would waste a lot of memory due to rounding up to the next
      power of two. 144K is large enough that it usually gets allocated with
      vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
      and save the pages in an array.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4b333f2
    • Omar Sandoval's avatar
      btrfs: send: write larger chunks when using stream v2 · 356bbbb6
      Omar Sandoval authored
      
      The length field of the send stream TLV header is 16 bits. This means
      that the maximum amount of data that can be sent for one write is 64K
      minus one. However, encoded writes must be able to send the maximum
      compressed extent (128K) in one command, or more. To support this, send
      stream version 2 encodes the DATA attribute differently: it has no
      length field, and the length is implicitly up to the end of containing
      command (which has a 32bit length field). Although this is necessary
      for encoded writes, normal writes can benefit from it, too.
      
      Also add a check to enforce that the DATA attribute is last. It is only
      strictly necessary for v2, but we might as well make v1 consistent with
      it.
      
      For v2, let's bump up the send buffer to the maximum compressed extent
      size plus 16K for the other metadata (144K total). Since this will most
      likely be vmalloc'd (and always will be after the next commit), we round
      it up to the next page since we might as well use the rest of the page
      on systems with >16K pages.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      356bbbb6
    • Omar Sandoval's avatar
      btrfs: send: add stream v2 definitions · b7c14f23
      Omar Sandoval authored
      
      This adds the definitions of the new commands for send stream version 2
      and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a.
      chattr), and encoded writes. It also documents two changes to the send
      stream format in v2: the receiver shouldn't assume a maximum command
      size, and the DATA attribute is encoded differently to allow for writes
      larger than 64k. These will be implemented in subsequent changes, and
      then the ioctl will accept the new version and flag.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7c14f23
    • Omar Sandoval's avatar
      btrfs: send: explicitly number commands and attributes · 54cab6af
      Omar Sandoval authored
      Commit e77fbf99
      
       ("btrfs: send: prepare for v2 protocol") added
      _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
      version plus 1, but as written this creates gaps in the number space.
      
      The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
      accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
      has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
      23 and 24 are valid commands.
      
      Instead, let's explicitly number all of the commands, attributes, and
      sentinel MAX constants.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54cab6af
    • Omar Sandoval's avatar
      btrfs: send: remove unused send_ctx::{total,cmd}_send_size · ca182acc
      Omar Sandoval authored
      
      We collect these statistics but have never exposed them in any way. I
      also didn't find any patches that ever attempted to make use of them.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca182acc
  2. 15 Jul, 2022 1 commit
  3. 17 May, 2022 1 commit
    • Filipe Manana's avatar
      btrfs: send: avoid trashing the page cache · 152555b3
      Filipe Manana authored
      
      A send operation reads extent data using the buffered IO path for getting
      extent data to send in write commands and this is both because it's simple
      and to make use of the generic readahead infrastructure, which results in
      a massive speedup.
      
      However this fills the page cache with data that, most of the time, is
      really only used by the send operation - once the write commands are sent,
      it's not useful to have the data in the page cache anymore. For large
      snapshots, bringing all data into the page cache eventually leads to the
      need to evict other data from the page cache that may be more useful for
      applications (and kernel subsystems).
      
      Even if extents are shared with the subvolume on which a snapshot is based
      on and the data is currently on the page cache due to being read through
      the subvolume, attempting to read the data through the snapshot will
      always result in bringing a new copy of the data into another location in
      the page cache (there's currently no shared memory for shared extents).
      
      So make send evict the data it has read before if when it first opened
      the inode, its mapping had no pages currently loaded: when
      inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
      based on the return value of filemap_range_has_page() before reading an
      extent because the generic readahead mechanism may read pages beyond the
      range we request (and it very often does it), which means a call to
      filemap_range_has_page() will return true due to the readahead that was
      triggered when processing a previous extent - we don't have a simple way
      to distinguish this case from the case where the data was brought into
      the page cache through someone else. So checking for the mapping number
      of pages being 0 when we first open the inode is simple, cheap and it
      generally accomplishes the goal of not trashing the page cache - the
      only exception is if part of data was previously loaded into the page
      cache through the snapshot by some other process, in that case we end
      up not evicting any data send brings into the page cache, just like
      before this change - but that however is not the common case.
      
      Example scenario, on a box with 32G of RAM:
      
        $ btrfs subvolume create /mnt/sv1
        $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
      
        $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       26866           0        4883       31297
        Swap:           8188           0        8188
      
        # After this we get less 4G of free memory.
        $ btrfs send /mnt/snap1 >/dev/null
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       22814           0        8935       31297
        Swap:           8188           0        8188
      
      The same, obviously, applies to an incremental send.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      152555b3
  4. 16 May, 2022 9 commits
  5. 09 May, 2022 1 commit
  6. 08 May, 2022 1 commit
  7. 14 Mar, 2022 2 commits
  8. 09 Feb, 2022 1 commit
  9. 07 Jan, 2022 1 commit
    • Filipe Manana's avatar
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana authored
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0
      
       ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d96b3424
  10. 03 Jan, 2022 4 commits
  11. 29 Oct, 2021 1 commit
    • David Sterba's avatar
      btrfs: send: prepare for v2 protocol · e77fbf99
      David Sterba authored
      
      This is preparatory work for send protocol update to version 2 and
      higher.
      
      We have many pending protocol update requests but still don't have the
      basic protocol rev in place, the first thing that must happen is to do
      the actual versioning support.
      
      The protocol version is u32 and is a new member in the send ioctl
      struct. Validity of the version field is backed by a new flag bit. Old
      kernels would fail when a higher version is requested. Version protocol
      0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION,
        that's also exported in sysfs.
      
      The version is still unchanged and will be increased once we have new
      incompatible commands or stream updates.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e77fbf99
  12. 25 Oct, 2021 1 commit
  13. 23 Aug, 2021 2 commits
  14. 22 Jun, 2021 5 commits
    • Filipe Manana's avatar
      btrfs: send: fix crash when memory allocations trigger reclaim · 35b22c19
      Filipe Manana authored
      When doing a send we don't expect the task to ever start a transaction
      after the initial check that verifies if commit roots match the regular
      roots. This is because after that we set current->journal_info with a
      stub (special value) that signals we are in send context, so that we take
      a read lock on an extent buffer when reading it from disk and verifying
      it is valid (its generation matches the generation stored in the parent).
      This stub was introduced in 2014 by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO") in order to fix a concurrency issue
      between send and balance.
      
      However there is one particular exception where we end up needing to start
      a transaction and when this happens it results in a crash with a stack
      trace like the following:
      
      [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
      [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
      [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
      [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
      [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
      [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
      [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
      [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
      [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
      [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
      [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
      [60015.902408] kernel: FS:  00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
      [60015.902410] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
      [60015.902414] kernel: Call Trace:
      [60015.902419] kernel:  kfence_unprotect+0x13/0x30
      [60015.902423] kernel:  page_fault_oops+0x89/0x270
      [60015.902427] kernel:  ? search_module_extables+0xf/0x40
      [60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
      [60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
      [60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
      [60015.902440] kernel:  exc_page_fault+0x67/0x150
      [60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
      [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
      [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
      [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
      [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
      [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
      [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
      [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
      [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
      [60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
      [60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
      [60015.902476] kernel:  evict+0xd1/0x180
      [60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
      [60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
      [60015.902487] kernel:  ? iput+0x1a0/0x1a0
      [60015.902489] kernel:  ? iput+0x1a0/0x1a0
      [60015.902491] kernel:  list_lru_walk_one+0x47/0x70
      [60015.902495] kernel:  prune_icache_sb+0x39/0x50
      [60015.902497] kernel:  super_cache_scan+0x161/0x1f0
      [60015.902501] kernel:  do_shrink_slab+0x142/0x240
      [60015.902505] kernel:  shrink_slab+0x164/0x280
      [60015.902509] kernel:  shrink_node+0x2c8/0x6e0
      [60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
      [60015.902514] kernel:  try_to_free_pages+0xda/0x190
      [60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
      [60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
      [60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
      [60015.902528] kernel:  pipe_write+0x30b/0x5c0
      [60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
      [60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
      [60015.902538] kernel:  __kernel_write+0x13a/0x2b0
      [60015.902541] kernel:  kernel_write+0x73/0x150
      [60015.902543] kernel:  send_cmd+0x7b/0xd0
      [60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
      [60015.902549] kernel:  process_extent+0x19b/0xed0
      [60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
      [60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
      [60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
      [60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
      [60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
      [60015.902564] kernel:  ? psi_task_change+0x84/0xc0
      [60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
      [60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
      [60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
      [60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
      [60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
      [60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
      [60015.902588] kernel:  do_syscall_64+0x33/0x40
      [60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
      [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
      [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
      [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
      [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
      [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
      [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
      [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---
      
      This happens because when writing to the pipe, by calling kernel_write(),
      we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
      gfp flags, which allow reclaim to happen if there is memory pressure. This
      allocation happens at fs/pipe.c:pipe_write().
      
      If the reclaim is triggered, inode eviction can be triggered and that in
      turn can result in starting a transaction if the inode has a link count
      of 0. The transaction start happens early on during eviction, when we call
      btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
      there is currently an open file descriptor for an inode with a link count
      of 0 and the reclaim task gets a reference on the inode before that
      descriptor is closed, in which case the reclaim task ends up doing the
      final iput that triggers the inode eviction.
      
      When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
      the following assertion at transaction.c:start_transaction():
      
          /* Send isn't supposed to start transactions. */
          ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
      
      And when assertions are not enabled, it triggers a crash since after that
      assertion we cast current->journal_info into a transaction handle pointer
      and then dereference it:
      
         if (current->journal_info) {
             WARN_ON(type & TRANS_EXTWRITERS);
             h = current->journal_info;
             refcount_inc(&h->use_count);
             (...)
      
      Which obviously results in a crash due to an invalid memory access.
      
      The same type of issue can happen during other memory allocations we
      do directly in the send code with kmalloc (and friends) as they use
      GFP_KERNEL and therefore may trigger reclaim too, which started to
      happen since 2016 after commit e780b0d1 ("btrfs: send: use
      GFP_KERNEL everywhere").
      
      The issue could be solved by setting up a NOFS context for the entire
      send operation so that reclaim could not be triggered when allocating
      memory or pages through kernel_write(). However that is not very friendly
      and we can in fact get rid of the send stub because:
      
      1) The stub was introduced way back in 2014 by commit a26e8c9f
         ("Btrfs: don't clear uptodate if the eb is under IO") to solve an
         issue exclusive to when send and balance are running in parallel,
         however there were other problems between balance and send and we do
         not allow anymore to have balance and send run concurrently since
         commit 9e967495 ("Btrfs: prevent send failures and crashes due
         to concurrent relocation"). More generically the issues are between
         send and relocation, and that last commit eliminated only the
         possibility of having send and balance run concurrently, but shrinking
         a device also can trigger relocation, and on zoned filesystems we have
         relocation of partially used block groups triggered automatically as
         well. The previous patch that has a subject of:
      
         "btrfs: ensure relocation never runs while we have send operations running"
      
         Addresses all the remaining cases that can trigger relocation.
      
      2) We can actually allow starting and even committing transactions while
         in a send context if needed because send is not holding any locks that
         would block the start or the commit of a transaction.
      
      So get rid of all the logic added by commit a26e8c9f
      
       ("Btrfs: don't
      clear uptodate if the eb is under IO"). We can now always call
      clear_extent_buffer_uptodate() at verify_parent_transid() since send is
      the only case that uses commit roots without having a transaction open or
      without holding the commit_root_sem.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35b22c19
    • Filipe Manana's avatar
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana authored
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf
      
       ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • David Sterba's avatar
      btrfs: fix typos in comments · 1a9fd417
      David Sterba authored
      
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a9fd417
    • Baokun Li's avatar
      btrfs: send: use list_move_tail instead of list_del/list_add_tail · bb930007
      Baokun Li authored
      
      Use list_move_tail() instead of list_del() + list_add_tail() as it's
      doing the same thing and allows further cleanups.  Open code
      name_cache_used() as there is only one user.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb930007
    • Filipe Manana's avatar
      btrfs: send: fix invalid path for unlink operations after parent orphanization · d8ac76cd
      Filipe Manana authored
      
      During an incremental send operation, when processing the new references
      for the current inode, we might send an unlink operation for another inode
      that has a conflicting path and has more than one hard link. However this
      path was computed and cached before we processed previous new references
      for the current inode. We may have orphanized a directory of that path
      while processing a previous new reference, in which case the path will
      be invalid and cause the receiver process to fail.
      
      The following reproducer triggers the problem and explains how/why it
      happens in its comments:
      
        $ cat test-send-unlink.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test files and directory. Inode 259 (file3) has two hard
        # links.
        touch $MNT/file1
        touch $MNT/file2
        touch $MNT/file3
      
        mkdir $MNT/A
        ln $MNT/file3 $MNT/A/hard_link
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- file1                          (ino 257)
        # |----- file2                          (ino 258)
        # |----- file3                          (ino 259)
        # |----- A/                             (ino 260)
        #        |---- hard_link                (ino 259)
        #
      
        # Now create the base snapshot, which is going to be the parent snapshot
        # for a later incremental send.
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Move inode 257 into directory inode 260. This results in computing the
        # path for inode 260 as "/A" and caching it.
        mv $MNT/file1 $MNT/A/file1
      
        # Move inode 258 (file2) into directory inode 260, with a name of
        # "hard_link", moving first inode 259 away since it currently has that
        # location and name.
        mv $MNT/A/hard_link $MNT/tmp
        mv $MNT/file2 $MNT/A/hard_link
      
        # Now rename inode 260 to something else (B for example) and then create
        # a hard link for inode 258 that has the old name and location of inode
        # 260 ("/A").
        mv $MNT/A $MNT/B
        ln $MNT/B/hard_link $MNT/A
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- tmp                            (ino 259)
        # |----- file3                          (ino 259)
        # |----- B/                             (ino 260)
        # |      |---- file1                    (ino 257)
        # |      |---- hard_link                (ino 258)
        # |
        # |----- A                              (ino 258)
      
        # Create another snapshot of our subvolume and use it for an incremental
        # send.
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        # First add the first snapshot to the new filesystem by applying the
        # first send stream.
        btrfs receive -f /tmp/snap1.send $MNT
      
        # The incremental receive operation below used to fail with the
        # following error:
        #
        #    ERROR: unlink A/hard_link failed: No such file or directory
        #
        # This is because when send is processing inode 257, it generates the
        # path for inode 260 as "/A", since that inode is its parent in the send
        # snapshot, and caches that path.
        #
        # Later when processing inode 258, it first processes its new reference
        # that has the path of "/A", which results in orphanizing inode 260
        # because there is a a path collision. This results in issuing a rename
        # operation from "/A" to "/o260-6-0".
        #
        # Finally when processing the new reference "B/hard_link" for inode 258,
        # it notices that it collides with inode 259 (not yet processed, because
        # it has a higher inode number), since that inode has the name
        # "hard_link" under the directory inode 260. It also checks that inode
        # 259 has two hardlinks, so it decides to issue a unlink operation for
        # the name "hard_link" for inode 259. However the path passed to the
        # unlink operation is "/A/hard_link", which is incorrect since currently
        # "/A" does not exists, due to the orphanization of inode 260 mentioned
        # before. The path is incorrect because it was computed and cached
        # before the orphanization. This results in the receiver to fail with
        # the above error.
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, it fails like this:
      
        $ ./test-send-unlink.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: unlink A/hard_link failed: No such file or directory
      
      Fix this by recomputing a path before issuing an unlink operation when
      processing the new references for the current inode if we previously
      have orphanized a directory.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8ac76cd
  15. 28 Apr, 2021 1 commit
    • Filipe Manana's avatar
      btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501
      Filipe Manana authored
      There are a few exceptional cases where cloning an inline extent needs to
      copy the inline extent data into a page of the destination inode.
      
      When this happens, we end up starting a transaction while having a dirty
      page for the destination inode and while having the range locked in the
      destination's inode iotree too. Because when reserving metadata space
      for a transaction we may need to flush existing delalloc in case there is
      not enough free space, we have a mechanism in place to prevent a deadlock,
      which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
      cloning inline extent and low on free metadata space").
      
      However when using qgroups, a transaction also reserves metadata qgroup
      space, which can also result in flushing delalloc in case there is not
      enough available space at the moment. When this happens we deadlock, since
      flushing delalloc requires locking the file range in the inode's iotree
      and the range was already locked at the very beginning of the clone
      operation, before attempting to start the transaction.
      
      When this issue happens, stack traces like the following are reported:
      
        [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
        [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
        [72747.556271] Call Trace:
        [72747.556273]  __schedule+0x296/0x760
        [72747.556277]  schedule+0x3c/0xa0
        [72747.556279]  io_schedule+0x12/0x40
        [72747.556284]  __lock_page+0x13c/0x280
        [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
        [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
        [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
        [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
        [72747.556362]  ? update_group_capacity+0x25/0x210
        [72747.556366]  ? cpumask_next_and+0x1a/0x20
        [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.556394]  do_writepages+0x41/0xd0
        [72747.556398]  __writeback_single_inode+0x39/0x2a0
        [72747.556403]  writeback_sb_inodes+0x1ea/0x440
        [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
        [72747.556410]  wb_writeback+0x235/0x2b0
        [72747.556414]  ? get_nr_inodes+0x35/0x50
        [72747.556417]  wb_workfn+0x354/0x490
        [72747.556420]  ? newidle_balance+0x2c5/0x3e0
        [72747.556424]  process_one_work+0x1aa/0x340
        [72747.556426]  worker_thread+0x30/0x390
        [72747.556429]  ? create_worker+0x1a0/0x1a0
        [72747.556432]  kthread+0x116/0x130
        [72747.556435]  ? kthread_park+0x80/0x80
        [72747.556438]  ret_from_fork+0x1f/0x30
      
        [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [72747.566961] Call Trace:
        [72747.566964]  __schedule+0x296/0x760
        [72747.566968]  ? finish_wait+0x80/0x80
        [72747.566970]  schedule+0x3c/0xa0
        [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
        [72747.566999]  ? finish_wait+0x80/0x80
        [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
        [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
        [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
        [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
        [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
        [72747.567106]  ? update_load_avg+0x7e/0x5f0
        [72747.567109]  ? enqueue_entity+0xf4/0x6f0
        [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
        [72747.567140]  do_writepages+0x41/0xd0
        [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
        [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
        [72747.567200]  process_one_work+0x1aa/0x340
        [72747.567202]  worker_thread+0x30/0x390
        [72747.567205]  ? create_worker+0x1a0/0x1a0
        [72747.567208]  kthread+0x116/0x130
        [72747.567211]  ? kthread_park+0x80/0x80
        [72747.567214]  ret_from_fork+0x1f/0x30
      
        [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
        [72747.569689] Call Trace:
        [72747.569691]  __schedule+0x296/0x760
        [72747.569694]  schedule+0x3c/0xa0
        [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569725]  ? finish_wait+0x80/0x80
        [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
        [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
        [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
        [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
        [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
        [72747.569836]  ? __kmalloc+0x16a/0x2c0
        [72747.569839]  do_iter_readv_writev+0x160/0x1c0
        [72747.569843]  do_iter_write+0x80/0x1b0
        [72747.569847]  vfs_writev+0x84/0x140
        [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
        [72747.569873]  do_writev+0x65/0x100
        [72747.569876]  do_syscall_64+0x33/0x40
        [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
        [72747.569903] Call Trace:
        [72747.569906]  __schedule+0x296/0x760
        [72747.569909]  schedule+0x3c/0xa0
        [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569940]  ? finish_wait+0x80/0x80
        [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
        [72747.569989]  start_transaction+0x279/0x580 [btrfs]
        [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
        [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
        [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
        [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
        [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
        [72747.570126]  do_clone_file_range+0xed/0x200
        [72747.570131]  vfs_clone_file_range+0x37/0x110
        [72747.570134]  ioctl_file_clone+0x7d/0xb0
        [72747.570137]  do_vfs_ioctl+0x138/0x630
        [72747.570140]  __x64_sys_ioctl+0x62/0xc0
        [72747.570143]  do_syscall_64+0x33/0x40
        [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      So fix this by skipping the flush of delalloc for an inode that is
      flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
      such a special case of cloning an inline extent, when flushing delalloc
      during qgroup metadata reservation.
      
      The special cases for cloning inline extents were added in kernel 5.7 by
      by commit 05a5a762 ("Btrfs: implement full reflink support for
      inline extents"), while having qgroup metadata space reservation flushing
      delalloc when low on space was added in kernel 5.9 by commit
      c53e9653
      
       ("btrfs: qgroup: try to flush qgroup space when we get
      -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
      kernel backports.
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
      Fixes: c53e9653
      
       ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9baa501