- 25 Jul, 2022 9 commits
-
-
David Sterba authored
The whole send operation is restartable and handling properly a buffer write may not be easy. We can't know what caused that and if a short delay and retry will fix it or how many retries should be performed in case it's a temporary condition. The error value is returned to the ioctl caller so in case it's transient problem, the user would be notified about the reason. Remove the TODO note as there's no plan to handle ERESTARTSYS. Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
We don't need the whole ctree.h in send.h, none of the data types defined there are used. Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Now that the new support is implemented, allow the ioctl to accept v2 and the compressed flag, and update the version in sysfs. Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Now that all of the pieces are in place, we can use the ENCODED_WRITE command to send compressed extents when appropriate. Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
For encoded writes in send v2, we will get the encoded data with btrfs_encoded_read_regular_fill_pages(), which expects a list of raw pages. To avoid extra buffers and copies, we should read directly into the send buffer. Therefore, we need the raw pages for the send buffer. We currently allocate the send buffer with kvmalloc(), which may return a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page(). However, the buffer size we use (144K) is not a power of two, which in theory is not guaranteed to return a page-aligned buffer, and in practice would waste a lot of memory due to rounding up to the next power of two. 144K is large enough that it usually gets allocated with vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc() and save the pages in an array. Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
The length field of the send stream TLV header is 16 bits. This means that the maximum amount of data that can be sent for one write is 64K minus one. However, encoded writes must be able to send the maximum compressed extent (128K) in one command, or more. To support this, send stream version 2 encodes the DATA attribute differently: it has no length field, and the length is implicitly up to the end of containing command (which has a 32bit length field). Although this is necessary for encoded writes, normal writes can benefit from it, too. Also add a check to enforce that the DATA attribute is last. It is only strictly necessary for v2, but we might as well make v1 consistent with it. For v2, let's bump up the send buffer to the maximum compressed extent size plus 16K for the other metadata (144K total). Since this will most likely be vmalloc'd (and always will be after the next commit), we round it up to the next page since we might as well use the rest of the page on systems with >16K pages. Reviewed-by:
Nikolay Borisov <nborisov@suse.com> Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
This adds the definitions of the new commands for send stream version 2 and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a. chattr), and encoded writes. It also documents two changes to the send stream format in v2: the receiver shouldn't assume a maximum command size, and the DATA attribute is encoded differently to allow for writes larger than 64k. These will be implemented in subsequent changes, and then the ioctl will accept the new version and flag. Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Commit e77fbf99 ("btrfs: send: prepare for v2 protocol") added _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the version plus 1, but as written this creates gaps in the number space. The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2 has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that 23 and 24 are valid commands. Instead, let's explicitly number all of the commands, attributes, and sentinel MAX constants. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
We collect these statistics but have never exposed them in any way. I also didn't find any patches that ever attempted to make use of them. Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 15 Jul, 2022 1 commit
-
-
David Sterba authored
This reverts commit 40769420. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 17 May, 2022 1 commit
-
-
Filipe Manana authored
A send operation reads extent data using the buffered IO path for getting extent data to send in write commands and this is both because it's simple and to make use of the generic readahead infrastructure, which results in a massive speedup. However this fills the page cache with data that, most of the time, is really only used by the send operation - once the write commands are sent, it's not useful to have the data in the page cache anymore. For large snapshots, bringing all data into the page cache eventually leads to the need to evict other data from the page cache that may be more useful for applications (and kernel subsystems). Even if extents are shared with the subvolume on which a snapshot is based on and the data is currently on the page cache due to being read through the subvolume, attempting to read the data through the snapshot will always result in bringing a new copy of the data into another location in the page cache (there's currently no shared memory for shared extents). So make send evict the data it has read before if when it first opened the inode, its mapping had no pages currently loaded: when inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding based on the return value of filemap_range_has_page() before reading an extent because the generic readahead mechanism may read pages beyond the range we request (and it very often does it), which means a call to filemap_range_has_page() will return true due to the readahead that was triggered when processing a previous extent - we don't have a simple way to distinguish this case from the case where the data was brought into the page cache through someone else. So checking for the mapping number of pages being 0 when we first open the inode is simple, cheap and it generally accomplishes the goal of not trashing the page cache - the only exception is if part of data was previously loaded into the page cache through the snapshot by some other process, in that case we end up not evicting any data send brings into the page cache, just like before this change - but that however is not the common case. Example scenario, on a box with 32G of RAM: $ btrfs subvolume create /mnt/sv1 $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1 $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1 $ free -m total used free shared buff/cache available Mem: 31937 186 26866 0 4883 31297 Swap: 8188 0 8188 # After this we get less 4G of free memory. $ btrfs send /mnt/snap1 >/dev/null $ free -m total used free shared buff/cache available Mem: 31937 186 22814 0 8935 31297 Swap: 8188 0 8188 The same, obviously, applies to an incremental send. Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 16 May, 2022 9 commits
-
-
Filipe Manana authored
Every time we send a write command, we open the inode, read some data to a buffer and then close the inode. The amount of data we read for each write command is at most 48K, returned by max_send_read_size(), and that corresponds to: BTRFS_SEND_BUF_SIZE - 16K = 48K. In practice this does not add any significant overhead, because the time elapsed between every close (iput()) and open (btrfs_iget()) is very short, so the inode is kept in the VFS's cache after the iput() and it's still there by the time we do the next btrfs_iget(). As between processing extents of the current inode we don't do anything else, it makes sense to keep the inode open after we process its first extent that needs to be sent and keep it open until we start processing the next inode. This serves to facilitate the next change, which aims to avoid having send operations trash the page cache with data extents. Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
… and adjust all usages of this object to use the XArray API for the sake of consistency. XArray API provides array semantics, so it is notionally easier to use and understand, and it also takes care of locking for us. None of this makes a real difference in this particular patch, but it does in other places where similar replacements are or have been made and we want to be consistent in our usage of data structures in btrfs. Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Gabriel Niebler authored
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by:
Gabriel Niebler <gniebler@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 09 May, 2022 1 commit
-
-
Matthew Wilcox (Oracle) authored
This is a "weak" conversion which converts straight back to using pages. A full conversion should be performed at some point, hopefully by someone familiar with the filesystem. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org>
-
- 08 May, 2022 1 commit
-
-
Matthew Wilcox (Oracle) authored
Removes a couple of calls to compound_head and saves a few bytes. Also convert verity's read_file_data_page() to be folio-based. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Christoph Hellwig <hch@lst.de>
-
- 14 Mar, 2022 2 commits
-
-
Minghao Chi authored
Return value from fs_path_add_path() directly instead of taking this in another redundant variable. Reported-by:
Zeal Robot <zealci@zte.com.cn> Reviewed-by:
Nikolay Borisov <nborisov@suse.com> Signed-off-by:
Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by:
CGEL ZTE <cgel.zte@gmail.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Sahil Kang authored
btrfs_ioctl extracts inode from file so we can pass that into the callbacks. Signed-off-by:
Sahil Kang <sahil.kang@asilaycomputing.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 09 Feb, 2022 1 commit
-
-
Dāvis Mosāns authored
Currently if we get IO error while doing send then we abort without logging information about which file caused issue. So log it to help with debugging. CC: stable@vger.kernel.org # 4.9+ Signed-off-by:
Dāvis Mosāns <davispuh@gmail.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 07 Jan, 2022 1 commit
-
-
Filipe Manana authored
We don't allow send and balance/relocation to run in parallel in order to prevent send failing or silently producing some bad stream. This is because while send is using an extent (specially metadata) or about to read a metadata extent and expecting it belongs to a specific parent node, relocation can run, the transaction used for the relocation is committed and the extent gets reallocated while send is still using the extent, so it ends up with a different content than expected. This can result in just failing to read a metadata extent due to failure of the validation checks (parent transid, level, etc), failure to find a backreference for a data extent, and other unexpected failures. Besides reallocation, there's also a similar problem of an extent getting discarded when it's unpinned after the transaction used for block group relocation is committed. The restriction between balance and send was added in commit 9e967495 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), kernel 5.3, while the more general restriction between send and relocation was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs while we have send operations running"), kernel 5.14. Both send and relocation can be very long running operations. Relocation because it has to do a lot of IO and expensive backreference lookups in case there are many snapshots, and send due to read IO when operating on very large trees. This makes it inconvenient for users and tools to deal with scheduling both operations. For zoned filesystem we also have automatic block group relocation, so send can fail with -EAGAIN when users least expect it or send can end up delaying the block group relocation for too long. In the future we might also get the automatic block group relocation for non zoned filesystems. This change makes it possible for send and relocation to run in parallel. This is achieved the following way: 1) For all tree searches, send acquires a read lock on the commit root semaphore; 2) After each tree search, and before releasing the commit root semaphore, the leaf is cloned and placed in the search path (struct btrfs_path); 3) After releasing the commit root semaphore, the changed_cb() callback is invoked, which operates on the leaf and writes commands to the pipe (or file in case send/receive is not used with a pipe). It's important here to not hold a lock on the commit root semaphore, because if we did we could deadlock when sending and receiving to the same filesystem using a pipe - the send task blocks on the pipe because it's full, the receive task, which is the only consumer of the pipe, triggers a transaction commit when attempting to create a subvolume or reserve space for a write operation for example, but the transaction commit blocks trying to write lock the commit root semaphore, resulting in a deadlock; 4) Before moving to the next key, or advancing to the next change in case of an incremental send, check if a transaction used for relocation was committed (or is about to finish its commit). If so, release the search path(s) and restart the search, to where we were before, so that we don't operate on stale extent buffers. The search restarts are always possible because both the send and parent roots are RO, and no one can add, remove of update keys (change their offset) in RO trees - the only exception is deduplication, but that is still not allowed to run in parallel with send; 5) Periodically check if there is contention on the commit root semaphore, which means there is a transaction commit trying to write lock it, and release the semaphore and reschedule if there is contention, so as to avoid causing any significant delays to transaction commits. This leaves some room for optimizations for send to have less path releases and re searching the trees when there's relocation running, but for now it's kept simple as it performs quite well (on very large trees with resulting send streams in the order of a few hundred gigabytes). Test case btrfs/187, from fstests, stresses relocation, send and deduplication attempting to run in parallel, but without verifying if send succeeds and if it produces correct streams. A new test case will be added that exercises relocation happening in parallel with send and then checks that send succeeds and the resulting streams are correct. A final note is that for now this still leaves the mutual exclusion between send operations and deduplication on files belonging to a root used by send operations. A solution for that will be slightly more complex but it will eventually be built on top of this change. Signed-off-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 03 Jan, 2022 4 commits
-
-
Omar Sandoval authored
Again, I don't think this was ever used since iterate_dir_item() is only used for xattrs. No functional change. Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
Nikolay Borisov <nborisov@suse.com> Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
As far as I can tell, this was never used. No functional change. Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
Nikolay Borisov <nborisov@suse.com> Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We have this pattern in a lot of places item = btrfs_item_nr(slot); btrfs_item_size(leaf, item); when we could simply use btrfs_item_size(leaf, slot); Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the _nr variation of the helpers. Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 29 Oct, 2021 1 commit
-
-
David Sterba authored
This is preparatory work for send protocol update to version 2 and higher. We have many pending protocol update requests but still don't have the basic protocol rev in place, the first thing that must happen is to do the actual versioning support. The protocol version is u32 and is a new member in the send ioctl struct. Validity of the version field is backed by a new flag bit. Old kernels would fail when a higher version is requested. Version protocol 0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION, that's also exported in sysfs. The version is still unchanged and will be increased once we have new incompatible commands or stream updates. Reviewed-by:
Nikolay Borisov <nborisov@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 25 Oct, 2021 1 commit
-
-
Marcos Paulo de Souza authored
The out label is being overused, we can simply return if the condition permits. No functional changes. Reviewed-by:
Su Yue <l@damenly.su> Reviewed-by:
Nikolay Borisov <nborisov@suse.com> Signed-off-by:
Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 23 Aug, 2021 2 commits
-
-
Goldwyn Rodrigues authored
Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx on stack. The size is reasonably small. sizeof(backref_ctx) = 48 Reviewed-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
Comparators just read the data and thus get const parameters. This should be also preserved by the local variables, update all comparators passed to sort or bsearch. Cleanups: - unnecessary casts are dropped - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern and 'inline' is dropped as the function address is taken Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 22 Jun, 2021 5 commits
-
-
Filipe Manana authored
When doing a send we don't expect the task to ever start a transaction after the initial check that verifies if commit roots match the regular roots. This is because after that we set current->journal_info with a stub (special value) that signals we are in send context, so that we take a read lock on an extent buffer when reading it from disk and verifying it is valid (its generation matches the generation stored in the parent). This stub was introduced in 2014 by commit a26e8c9f ("Btrfs: don't clear uptodate if the eb is under IO") in order to fix a concurrency issue between send and balance. However there is one particular exception where we end up needing to start a transaction and when this happens it results in a crash with a stack trace like the following: [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80 [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...) [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1 [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015 [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80 [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...) [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246 [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224 [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000 [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002 [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001 [60015.902408] kernel: FS: 00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000 [60015.902410] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0 [60015.902414] kernel: Call Trace: [60015.902419] kernel: kfence_unprotect+0x13/0x30 [60015.902423] kernel: page_fault_oops+0x89/0x270 [60015.902427] kernel: ? search_module_extables+0xf/0x40 [60015.902431] kernel: ? search_bpf_extables+0x57/0x70 [60015.902435] kernel: kernelmode_fixup_or_oops+0xd6/0xf0 [60015.902437] kernel: __bad_area_nosemaphore+0x142/0x180 [60015.902440] kernel: exc_page_fault+0x67/0x150 [60015.902445] kernel: asm_exc_page_fault+0x1e/0x30 [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580 [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...) [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246 [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039 [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000 [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001 [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000 [60015.902468] kernel: btrfs_commit_inode_delayed_inode+0x5d/0x120 [60015.902473] kernel: btrfs_evict_inode+0x2c5/0x3f0 [60015.902476] kernel: evict+0xd1/0x180 [60015.902480] kernel: inode_lru_isolate+0xe7/0x180 [60015.902483] kernel: __list_lru_walk_one+0x77/0x150 [60015.902487] kernel: ? iput+0x1a0/0x1a0 [60015.902489] kernel: ? iput+0x1a0/0x1a0 [60015.902491] kernel: list_lru_walk_one+0x47/0x70 [60015.902495] kernel: prune_icache_sb+0x39/0x50 [60015.902497] kernel: super_cache_scan+0x161/0x1f0 [60015.902501] kernel: do_shrink_slab+0x142/0x240 [60015.902505] kernel: shrink_slab+0x164/0x280 [60015.902509] kernel: shrink_node+0x2c8/0x6e0 [60015.902512] kernel: do_try_to_free_pages+0xcb/0x4b0 [60015.902514] kernel: try_to_free_pages+0xda/0x190 [60015.902516] kernel: __alloc_pages_slowpath.constprop.0+0x373/0xcc0 [60015.902521] kernel: ? __memcg_kmem_charge_page+0xc2/0x1e0 [60015.902525] kernel: __alloc_pages_nodemask+0x30a/0x340 [60015.902528] kernel: pipe_write+0x30b/0x5c0 [60015.902531] kernel: ? set_next_entity+0xad/0x1e0 [60015.902534] kernel: ? switch_mm_irqs_off+0x58/0x440 [60015.902538] kernel: __kernel_write+0x13a/0x2b0 [60015.902541] kernel: kernel_write+0x73/0x150 [60015.902543] kernel: send_cmd+0x7b/0xd0 [60015.902545] kernel: send_extent_data+0x5a3/0x6b0 [60015.902549] kernel: process_extent+0x19b/0xed0 [60015.902551] kernel: btrfs_ioctl_send+0x1434/0x17e0 [60015.902554] kernel: ? _btrfs_ioctl_send+0xe1/0x100 [60015.902557] kernel: _btrfs_ioctl_send+0xbf/0x100 [60015.902559] kernel: ? enqueue_entity+0x18c/0x7b0 [60015.902562] kernel: btrfs_ioctl+0x185f/0x2f80 [60015.902564] kernel: ? psi_task_change+0x84/0xc0 [60015.902569] kernel: ? _flat_send_IPI_mask+0x21/0x40 [60015.902572] kernel: ? check_preempt_curr+0x2f/0x70 [60015.902576] kernel: ? selinux_file_ioctl+0x137/0x1e0 [60015.902579] kernel: ? expand_files+0x1cb/0x1d0 [60015.902582] kernel: ? __x64_sys_ioctl+0x82/0xb0 [60015.902585] kernel: __x64_sys_ioctl+0x82/0xb0 [60015.902588] kernel: do_syscall_64+0x33/0x40 [60015.902591] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...) [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004 [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640 [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000 [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0 [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]--- This happens because when writing to the pipe, by calling kernel_write(), we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the gfp flags, which allow reclaim to happen if there is memory pressure. This allocation happens at fs/pipe.c:pipe_write(). If the reclaim is triggered, inode eviction can be triggered and that in turn can result in starting a transaction if the inode has a link count of 0. The transaction start happens early on during eviction, when we call btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if there is currently an open file descriptor for an inode with a link count of 0 and the reclaim task gets a reference on the inode before that descriptor is closed, in which case the reclaim task ends up doing the final iput that triggers the inode eviction. When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers the following assertion at transaction.c:start_transaction(): /* Send isn't supposed to start transactions. */ ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB); And when assertions are not enabled, it triggers a crash since after that assertion we cast current->journal_info into a transaction handle pointer and then dereference it: if (current->journal_info) { WARN_ON(type & TRANS_EXTWRITERS); h = current->journal_info; refcount_inc(&h->use_count); (...) Which obviously results in a crash due to an invalid memory access. The same type of issue can happen during other memory allocations we do directly in the send code with kmalloc (and friends) as they use GFP_KERNEL and therefore may trigger reclaim too, which started to happen since 2016 after commit e780b0d1 ("btrfs: send: use GFP_KERNEL everywhere"). The issue could be solved by setting up a NOFS context for the entire send operation so that reclaim could not be triggered when allocating memory or pages through kernel_write(). However that is not very friendly and we can in fact get rid of the send stub because: 1) The stub was introduced way back in 2014 by commit a26e8c9f ("Btrfs: don't clear uptodate if the eb is under IO") to solve an issue exclusive to when send and balance are running in parallel, however there were other problems between balance and send and we do not allow anymore to have balance and send run concurrently since commit 9e967495 ("Btrfs: prevent send failures and crashes due to concurrent relocation"). More generically the issues are between send and relocation, and that last commit eliminated only the possibility of having send and balance run concurrently, but shrinking a device also can trigger relocation, and on zoned filesystems we have relocation of partially used block groups triggered automatically as well. The previous patch that has a subject of: "btrfs: ensure relocation never runs while we have send operations running" Addresses all the remaining cases that can trigger relocation. 2) We can actually allow starting and even committing transactions while in a send context if needed because send is not holding any locks that would block the start or the commit of a transaction. So get rid of all the logic added by commit a26e8c9f ("Btrfs: don't clear uptodate if the eb is under IO"). We can now always call clear_extent_buffer_uptodate() at verify_parent_transid() since send is the only case that uses commit roots without having a transaction open or without holding the commit_root_sem. Reported-by:
Chris Murphy <lists@colorremedies.com> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/ Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Relocation and send do not play well together because while send is running a block group can be relocated, a transaction committed and the respective disk extents get re-allocated and written to or discarded while send is about to do something with the extents. This was explained in commit 9e967495 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), which prevented balance and send from running in parallel but it did not address one remaining case where chunk relocation can happen: shrinking a device (and device deletion which shrinks a device's size to 0 before deleting the device). We also have now one more case where relocation is triggered: on zoned filesystems partially used block groups get relocated by a background thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically reclaim zones"). So make sure that instead of preventing balance from running when there are ongoing send operations, we prevent relocation from happening. This uses the infrastructure recently added by a patch that has the subject: "btrfs: add cancellable chunk relocation support". Also it adds a spinlock used exclusively for the exclusivity between send and relocation, as before fs_info->balance_mutex was used, which would make an attempt to run send to block waiting for balance to finish, which can take a lot of time on large filesystems. Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by:
David Sterba <dsterba@suse.com>
-
Baokun Li authored
Use list_move_tail() instead of list_del() + list_add_tail() as it's doing the same thing and allows further cleanups. Open code name_cache_used() as there is only one user. Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
Baokun Li <libaokun1@huawei.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During an incremental send operation, when processing the new references for the current inode, we might send an unlink operation for another inode that has a conflicting path and has more than one hard link. However this path was computed and cached before we processed previous new references for the current inode. We may have orphanized a directory of that path while processing a previous new reference, in which case the path will be invalid and cause the receiver process to fail. The following reproducer triggers the problem and explains how/why it happens in its comments: $ cat test-send-unlink.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV >/dev/null mount $DEV $MNT # Create our test files and directory. Inode 259 (file3) has two hard # links. touch $MNT/file1 touch $MNT/file2 touch $MNT/file3 mkdir $MNT/A ln $MNT/file3 $MNT/A/hard_link # Filesystem looks like: # # . (ino 256) # |----- file1 (ino 257) # |----- file2 (ino 258) # |----- file3 (ino 259) # |----- A/ (ino 260) # |---- hard_link (ino 259) # # Now create the base snapshot, which is going to be the parent snapshot # for a later incremental send. btrfs subvolume snapshot -r $MNT $MNT/snap1 btrfs send -f /tmp/snap1.send $MNT/snap1 # Move inode 257 into directory inode 260. This results in computing the # path for inode 260 as "/A" and caching it. mv $MNT/file1 $MNT/A/file1 # Move inode 258 (file2) into directory inode 260, with a name of # "hard_link", moving first inode 259 away since it currently has that # location and name. mv $MNT/A/hard_link $MNT/tmp mv $MNT/file2 $MNT/A/hard_link # Now rename inode 260 to something else (B for example) and then create # a hard link for inode 258 that has the old name and location of inode # 260 ("/A"). mv $MNT/A $MNT/B ln $MNT/B/hard_link $MNT/A # Filesystem now looks like: # # . (ino 256) # |----- tmp (ino 259) # |----- file3 (ino 259) # |----- B/ (ino 260) # | |---- file1 (ino 257) # | |---- hard_link (ino 258) # | # |----- A (ino 258) # Create another snapshot of our subvolume and use it for an incremental # send. btrfs subvolume snapshot -r $MNT $MNT/snap2 btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2 # Now unmount the filesystem, create a new one, mount it and try to # apply both send streams to recreate both snapshots. umount $DEV mkfs.btrfs -f $DEV >/dev/null mount $DEV $MNT # First add the first snapshot to the new filesystem by applying the # first send stream. btrfs receive -f /tmp/snap1.send $MNT # The incremental receive operation below used to fail with the # following error: # # ERROR: unlink A/hard_link failed: No such file or directory # # This is because when send is processing inode 257, it generates the # path for inode 260 as "/A", since that inode is its parent in the send # snapshot, and caches that path. # # Later when processing inode 258, it first processes its new reference # that has the path of "/A", which results in orphanizing inode 260 # because there is a a path collision. This results in issuing a rename # operation from "/A" to "/o260-6-0". # # Finally when processing the new reference "B/hard_link" for inode 258, # it notices that it collides with inode 259 (not yet processed, because # it has a higher inode number), since that inode has the name # "hard_link" under the directory inode 260. It also checks that inode # 259 has two hardlinks, so it decides to issue a unlink operation for # the name "hard_link" for inode 259. However the path passed to the # unlink operation is "/A/hard_link", which is incorrect since currently # "/A" does not exists, due to the orphanization of inode 260 mentioned # before. The path is incorrect because it was computed and cached # before the orphanization. This results in the receiver to fail with # the above error. btrfs receive -f /tmp/snap2.send $MNT umount $MNT When running the test, it fails like this: $ ./test-send-unlink.sh Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1' At subvol /mnt/sdi/snap1 Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2' At subvol /mnt/sdi/snap2 At subvol snap1 At snapshot snap2 ERROR: unlink A/hard_link failed: No such file or directory Fix this by recomputing a path before issuing an unlink operation when processing the new references for the current inode if we previously have orphanized a directory. A test case for fstests will follow soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 28 Apr, 2021 1 commit
-
-
Filipe Manana authored
There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit 3d45f221 ("btrfs: fix deadlock when cloning inline extent and low on free metadata space"). However when using qgroups, a transaction also reserves metadata qgroup space, which can also result in flushing delalloc in case there is not enough available space at the moment. When this happens we deadlock, since flushing delalloc requires locking the file range in the inode's iotree and the range was already locked at the very beginning of the clone operation, before attempting to start the transaction. When this issue happens, stack traces like the following are reported: [72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000 [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142) [72747.556271] Call Trace: [72747.556273] __schedule+0x296/0x760 [72747.556277] schedule+0x3c/0xa0 [72747.556279] io_schedule+0x12/0x40 [72747.556284] __lock_page+0x13c/0x280 [72747.556287] ? generic_file_readonly_mmap+0x70/0x70 [72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs] [72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160 [72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs] [72747.556362] ? update_group_capacity+0x25/0x210 [72747.556366] ? cpumask_next_and+0x1a/0x20 [72747.556391] extent_writepages+0x44/0xa0 [btrfs] [72747.556394] do_writepages+0x41/0xd0 [72747.556398] __writeback_single_inode+0x39/0x2a0 [72747.556403] writeback_sb_inodes+0x1ea/0x440 [72747.556407] __writeback_inodes_wb+0x5f/0xc0 [72747.556410] wb_writeback+0x235/0x2b0 [72747.556414] ? get_nr_inodes+0x35/0x50 [72747.556417] wb_workfn+0x354/0x490 [72747.556420] ? newidle_balance+0x2c5/0x3e0 [72747.556424] process_one_work+0x1aa/0x340 [72747.556426] worker_thread+0x30/0x390 [72747.556429] ? create_worker+0x1a0/0x1a0 [72747.556432] kthread+0x116/0x130 [72747.556435] ? kthread_park+0x80/0x80 [72747.556438] ret_from_fork+0x1f/0x30 [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [72747.566961] Call Trace: [72747.566964] __schedule+0x296/0x760 [72747.566968] ? finish_wait+0x80/0x80 [72747.566970] schedule+0x3c/0xa0 [72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs] [72747.566999] ? finish_wait+0x80/0x80 [72747.567024] lock_extent_bits+0x37/0x90 [btrfs] [72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs] [72747.567051] ? find_get_pages_range_tag+0x2cd/0x380 [72747.567076] __extent_writepage+0x203/0x320 [btrfs] [72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs] [72747.567106] ? update_load_avg+0x7e/0x5f0 [72747.567109] ? enqueue_entity+0xf4/0x6f0 [72747.567134] extent_writepages+0x44/0xa0 [btrfs] [72747.567137] ? enqueue_task_fair+0x93/0x6f0 [72747.567140] do_writepages+0x41/0xd0 [72747.567144] __filemap_fdatawrite_range+0xc7/0x100 [72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs] [72747.567200] process_one_work+0x1aa/0x340 [72747.567202] worker_thread+0x30/0x390 [72747.567205] ? create_worker+0x1a0/0x1a0 [72747.567208] kthread+0x116/0x130 [72747.567211] ? kthread_park+0x80/0x80 [72747.567214] ret_from_fork+0x1f/0x30 [72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000 [72747.569689] Call Trace: [72747.569691] __schedule+0x296/0x760 [72747.569694] schedule+0x3c/0xa0 [72747.569721] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569725] ? finish_wait+0x80/0x80 [72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs] [72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs] [72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs] [72747.569810] ? path_lookupat.isra.48+0x97/0x140 [72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs] [72747.569836] ? __kmalloc+0x16a/0x2c0 [72747.569839] do_iter_readv_writev+0x160/0x1c0 [72747.569843] do_iter_write+0x80/0x1b0 [72747.569847] vfs_writev+0x84/0x140 [72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs] [72747.569873] do_writev+0x65/0x100 [72747.569876] do_syscall_64+0x33/0x40 [72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000 [72747.569903] Call Trace: [72747.569906] __schedule+0x296/0x760 [72747.569909] schedule+0x3c/0xa0 [72747.569936] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569940] ? finish_wait+0x80/0x80 [72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs] [72747.569989] start_transaction+0x279/0x580 [btrfs] [72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs] [72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs] [72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs] [72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs] [72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs] [72747.570126] do_clone_file_range+0xed/0x200 [72747.570131] vfs_clone_file_range+0x37/0x110 [72747.570134] ioctl_file_clone+0x7d/0xb0 [72747.570137] do_vfs_ioctl+0x138/0x630 [72747.570140] __x64_sys_ioctl+0x62/0xc0 [72747.570143] do_syscall_64+0x33/0x40 [72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So fix this by skipping the flush of delalloc for an inode that is flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under such a special case of cloning an inline extent, when flushing delalloc during qgroup metadata reservation. The special cases for cloning inline extents were added in kernel 5.7 by by commit 05a5a762 ("Btrfs: implement full reflink support for inline extents"), while having qgroup metadata space reservation flushing delalloc when low on space was added in kernel 5.9 by commit c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable kernel backports. Reported-by:
Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/ Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.9+ Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-