- 21 Jun, 2021 40 commits
-
-
Qu Wenruo authored
[BUG] With current btrfs subpage rw support, the following script can lead to fs hang: $ mkfs.btrfs -f -s 4k $dev $ mount $dev -o nospace_cache $mnt $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt The fs will hang at btrfs_start_ordered_extent(). [CAUSE] In above test case, btrfs_invalidate() will be called with the following parameters: offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000 Since @offset is 0, btrfs_invalidate() will try to invalidate the full page, and finally call clear_page_extent_mapped() which will detach subpage structure from the page. And since the page no longer has subpage structure, the subpage dirty bitmap will be cleared, preventing the dirty range from being written back, thus no way to wake up the ordered extent. [FIX] Just follow other filesystems, only to invalidate the page if the range covers the full page. There are cases like truncate_setsize() which can call btrfs_invalidatepage() with offset == 0 and length != 0 for the last page of an inode. Although the old code will still try to invalidate the full page, we are still safe to just wait for ordered extent to finish. So it shouldn't cause extra problems. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] With current subpage RW support, the following script can hang the fs with 64K page size. # mkfs.btrfs -f -s 4k $dev # mount $dev -o nospace_cache $mnt # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt The kernel will do an infinite loop in btrfs_punch_hole_lock_range(). [CAUSE] In btrfs_punch_hole_lock_range() we: - Truncate page cache range - Lock extent io tree - Wait any ordered extents in the range. We exit the loop until we meet all the following conditions: - No ordered extent in the lock range - No page is in the lock range The latter condition has a pitfall, it only works for sector size == PAGE_SIZE case. While can't handle the following subpage case: 0 32K 64K 96K 128K | |///////||//////| || lockstart=32K lockend=96K - 1 In this case, although the range crosses 2 pages, truncate_pagecache_range() will invalidate no page at all, but only zero the [32K, 96K) range of the two pages. Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we will never meet the loop exit condition. [FIX] Fix the problem by doing page alignment for the lock range. Function filemap_range_has_page() has already handled lend < lstart case, we only need to round up @lockstart, and round_down @lockend for truncate_pagecache_range(). This modification should not change any thing for sector size == PAGE_SIZE case, as in that case our range is already page aligned. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The modifications are: - Page copy destination For subpage case, one page can contain multiple sectors, thus we can no longer expect the memcpy_to_page()/btrfs_decompress() to copy data into page offset 0. The correct offset is offset_in_page(file_offset) now, which should handle both regular sectorsize and subpage cases well. - Page status update Now we need to use subpage helper to handle the page status update. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Only set_page_dirty() and SetPageUptodate() is not subpage compatible. Convert them to subpage helpers, so that __extent_writepage_io() can submit page content correctly. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
btrfs_truncate_block() itself is already mostly subpage compatible, the only missing part is the page dirtying code. Currently if we have a sector that needs to be truncated, we set the sector aligned range delalloc, then set the full page dirty. The problem is, current subpage code requires subpage dirty bit to be set, or __extent_writepage_io() won't submit bio, thus leads to ordered extent never to finish. So this patch will make btrfs_truncate_block() to call btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the problem. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
__extent_writepage_io() function originally just iterates through all the extent maps of a page, and submits any regular extents. This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we need to submit the only sector contained in the page. But for subpage case, one dirty page can contain several clean sectors with at least one dirty sector. If __extent_writepage_io() still submit all regular extent maps, it can submit data which is already written to disk. And since such already written data won't have corresponding ordered extents, it will trigger a BUG_ON() in btrfs_csum_one_bio(). Change the behavior of __extent_writepage_io() by finding the first dirty byte in the page, and only submit the dirty range other than the full extent. Since we're also here, also modify the following calls to be subpage compatible: - SetPageError() - end_page_writeback() Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Function btrfs_set_range_writeback() currently just sets the page writeback unconditionally. Change it to call the subpage helper so that we can handle both cases well. Since the subpage helpers needs btrfs_fs_info, also change the parameter to accept btrfs_inode. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
In cow_file_range(), after we have succeeded creating an inline extent, we unlock the page with extent_clear_unlock_delalloc() by passing locked_page == NULL. For sectorsize == PAGE_SIZE case, this is just making the page lock and unlock harder to grab. But for incoming subpage case, it can be a big problem. For incoming subpage case, page locking have two entry points: - __process_pages_contig() In that case, we know exactly the range we want to lock (which only requires sector alignment). To handle the subpage requirement, we introduce btrfs_subpage::writers to page::private, and will update it in __process_pages_contig(). - Other directly lock/unlock_page() call sites Those won't touch btrfs_subpage::writers at all. This means, page locked by __process_pages_contig() can only be unlocked by __process_pages_contig(). Thankfully we already have the existing infrastructure in the form of @locked_page in various call sites. Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after creating an inline extent is the exception. It intentionally call extent_clear_unlock_delalloc() with locked_page == NULL, to also unlock current page (and clear its dirty/writeback bits). To co-operate with incoming subpage modifications, and make the page lock/unlock pair easier to understand, this patch will still call extent_clear_unlock_delalloc() with locked_page, and only unlock the page in __extent_writepage(). Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
When __process_pages_contig() gets called for extent_clear_unlock_delalloc(), if we hit the locked page, only Private2 bit is updated, but dirty/writeback/error bits are all skipped. There are several call sites that call extent_clear_unlock_delalloc() with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK - cow_file_range() - run_delalloc_nocow() - cow_file_range_async() All for their error handling branches. For those call sites, since we skip the locked page for dirty/error/writeback bit update, the locked page will still have its subpage dirty bit remaining. Normally it's the call sites which locked the page to handle the locked page, but it won't hurt if we also do the update. Especially there are already other call sites doing the same thing by manually passing NULL as locked_page. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This involves the following modification: - Ordered extent creation This is done in process_one_page(), now PAGE_SET_ORDERED will call subpage helper to do the work. - endio functions This is done in btrfs_mark_ordered_io_finished(). - btrfs_invalidatepage() - btrfs_cleanup_ordered_extents() Use the subpage page helper, and add an extra branch to exit if the locked page have covered the full range. Now the usage of page Ordered flag for ordered extent accounting is fully subpage compatible. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This patch introduces the following functions to handle btrfs subpage ordered (Private2) status: - btrfs_subpage_set_ordered() - btrfs_subpage_clear_ordered() - btrfs_subpage_test_ordered() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_ordered() - btrfs_page_clear_ordered() - btrfs_page_test_ordered() These helpers can handle both regular sector size and subpage without problem. These functions are here to coordinate btrfs_invalidatepage() with btrfs_writepage_endio_finish_ordered(), to make sure only one of those functions can finish the ordered extent. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Introduce a new data inodes specific subpage member, writers, to record how many sectors are under page lock for delalloc writing. This member acts pretty much the same as readers, except it's only for delalloc writes. This is important for delalloc code to trace which page can really be freed, as we have cases like run_delalloc_nocow() where we may exit processing nocow range inside a page, but need to exit to do cow half way. In that case, we need a way to determine if we can really unlock a full page. With the new btrfs_subpage::writers, there is a new requirement: - Page locked by process_one_page() must be unlocked by process_one_page() There are still tons of call sites manually lock and unlock a page, without updating btrfs_subpage::writers. So if we lock a page through process_one_page() then it must be unlocked by process_one_page() to keep btrfs_subpage::writers consistent. This will be handled in next patch. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Now in end_bio_extent_writepage(), the only subpage incompatible code is the end_page_writeback(). Just call the subpage helpers. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
For __process_pages_contig() and process_one_page(), to handle subpage we only need to pass bytenr in and call subpage helpers to handle dirty/error/writeback status. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Since the extent io tree operations in btrfs_dirty_pages() are already subpage compatible, we only need to make the page status update to use subpage helpers. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Just like read page, for subpage support we only require sector size alignment. So change the error message condition to only require sector alignment. This should not affect existing code, as for regular sectorsize == PAGE_SIZE case, we are still requiring page alignment. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
In the coming subpage RW supports, there are a lot of page status update calls which need to be converted to subpage compatible version, which needs @start and @len. Some call sites already have such @start/@len and are already in page range, like various endio functions. But there are also call sites which need to clamp the range for subpage case, like btrfs_dirty_pagse() and __process_contig_pages(). Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the clamp for subpage version. Although in theory all existing btrfs_page_*() calls can be converted to use btrfs_page_clamp_*() directly, but that would make us to do unnecessary clamp operations. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
In __process_pages_contig() we update page status according to page_ops. That update process is a bunch of 'if' branches, which lie inside two loops, this makes it pretty hard to expand for later subpage operations. So this patch will extract these operations into its own function, process_one_pages(). Also since we're refactoring __process_pages_contig(), also move the new helper and __process_pages_contig() before the first caller of them, to remove the forward declaration. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
As a preparation for incoming subpage support, we need bytenr passed to __process_pages_contig() directly, not the current page index. So change the parameter and all callers to pass bytenr in. With the modification, here we need to replace the old @index_ret with @processed_end for __process_pages_contig(), but this brings a small problem. Normally we follow the inclusive return value, meaning @processed_end should be the last byte we processed. If parameter @start is 0, and we failed to lock any page, then we would return @processed_end as -1, causing more problems for __unlock_for_delalloc(). So here for @processed_end, we use two different return value patterns. If we have locked any page, @processed_end will be the last byte of locked page. Or it will be @start otherwise. This change will impact lock_delalloc_pages(), so it needs to check @processed_end to only unlock the range if we have locked any. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When running subpage preparation patches on x86, btrfs/125 will hang forever with one ordered extent never finished. [CAUSE] The test case btrfs/125 itself will always fail as the fix is never merged. When the test fails at balance, btrfs needs to cleanup the ordered extent in btrfs_cleanup_ordered_extents() for data reloc inode. The problem is in the sequence how we cleanup the page Order bit. Currently it works like: btrfs_cleanup_ordered_extents() |- find_get_page(); |- btrfs_page_clear_ordered(page); | Now the page doesn't have Ordered bit anymore. | !!! This also includes the first (locked) page !!! | |- offset += PAGE_SIZE | This is to skip the first page |- __endio_write_update_ordered() |- btrfs_mark_ordered_io_finished(NULL) Except the first page, all ordered extents are finished. Then the locked page is cleaned up in __extent_writepage(): __extent_writepage() |- If (PageError(page)) |- end_extent_writepage() |- btrfs_mark_ordered_io_finished(page) |- if (btrfs_test_page_ordered(page)) |- !!! The page gets skipped !!! The ordered extent is not decreased as the page doesn't have ordered bit anymore. This leaves the ordered extent with bytes_left == sectorsize, thus never finish. [FIX] The fix is to ensure we never clear page Ordered bit without running the ordered extent accounting. Here we choose to skip the locked page in btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can properly finish the ordered extent. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Inside btrfs we use Private2 page status to indicate we have an ordered extent with pending IO for the sector. But the page status name, Private2, tells us nothing about the bit itself, so this patch will rename it to Ordered. And with extra comment about the bit added, so reader who is still uncertain about the page Ordered status, will find the comment pretty easily. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This patch will refactor btrfs_invalidatepage() for the incoming subpage support. The involved modifications are: - Use while() loop instead of "goto again;" - Use single variable to determine whether to delete extent states Each branch will also have comments why we can or cannot delete the extent states - Do qgroup free and extent states deletion per-loop Current code can only work for PAGE_SIZE == sectorsize case. This refactor also makes it clear what we do for different sectors: - Sectors without ordered extent We're completely safe to remove all extent states for the sector(s) - Sectors with ordered extent, but no Private2 bit This means the endio has already been executed, we can't remove all extent states for the sector(s). - Sectors with ordere extent, still has Private2 bit This means we need to decrease the ordered extent accounting. And then it comes to two different variants: * We have finished and removed the ordered extent Then it's the same as "sectors without ordered extent" * We didn't finished the ordered extent We can remove some extent states, but not all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Although we already have btrfs_lookup_first_ordered_extent() and btrfs_lookup_ordered_extent(), they all have their own limitations: - btrfs_lookup_ordered_extent() can't do extra range check It's only designed to lookup any ordered extent before certain bytenr. - btrfs_lookup_first_ordered_extent() may not return the first ordered extent in the range It doesn't ensure the first ordered extent is returned. The existing callers are only interested in exhausting all ordered extents in a range, the order is not important. For incoming btrfs_invalidatepage() refactoring, we need a way to properly iterate all ordered extents in their bytenr order of a range. So this patch will introduce a new function, btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr order awareness and extra range check. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The existing comments in btrfs_invalidatepage() don't really get to the point, especially for what Private2 is really representing and how the race avoidance is done. The truth is, there are only three entrances to do ordered extent accounting: - btrfs_writepage_endio_finish_ordered() - __endio_write_update_ordered() Those two entrance are just endio functions for dio and buffered write. - btrfs_invalidatepage() But there is a pitfall, in endio functions there is no check on whether the ordered extent is already accounted. They just blindly clear the Private2 bit and do the accounting. So it's all btrfs_invalidatepage()'s responsibility to make sure we won't do double account for the same sector. That's why in btrfs_invalidatepage() we have to wait for page writeback, this will ensure all submitted bios have finished, thus their endio functions have finished the accounting on the ordered extent. Then we also check page Private2 to ensure that, we only run ordered extent accounting on pages who has no bio submitted. This patch will rework related comments to make it more clear on the race and how we use wait_on_page_writeback() and Private2 to prevent double accounting on ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Btrfs has two endio functions to mark certain io range finished for ordered extents: - __endio_write_update_ordered() This is for direct IO - btrfs_writepage_endio_finish_ordered() This for buffered IO. However they go different routines to handle ordered extent io: - Whether to iterate through all ordered extents __endio_write_update_ordered() will but btrfs_writepage_endio_finish_ordered() will not. In fact, iterating through all ordered extents will benefit later subpage support, while for current PAGE_SIZE == sectorsize requirement this behavior makes no difference. - Whether to update page Private2 flag __endio_write_update_ordered() will not update page Private2 flag as for iomap direct IO, the page can not be even mapped. While btrfs_writepage_endio_finish_ordered() will clear Private2 to prevent double accounting against btrfs_invalidatepage(). Those differences are pretty subtle, and the ordered extent iterations code in callers makes code much harder to read. So this patch will introduce a new function, btrfs_mark_ordered_io_finished(), to do the heavy lifting: - Iterate through all ordered extents in the range - Do the ordered extent accounting - Queue the work for finished ordered extent This function has two new feature: - Proper underflow detection and recovery The old underflow detection will only detect the problem, then continue. No proper info like root/inode/ordered extent info, nor noisy enough to be caught by fstests. Furthermore when underflow happens, the ordered extent will never finish. New error detection will reset the bytes_left to 0, do proper kernel warning, and output extra info including root, ino, ordered extent range, the underflow value. - Prevent double accounting based on Private2 flag Now if we find a range without Private2 flag, we will skip to next range. As that means someone else has already finished the accounting of ordered extent. This makes no difference for current code, but will be a critical part for incoming subpage support, as we can call btrfs_mark_ordered_io_finished() for multiple sectors if they are beyond inode size. Thus such double accounting prevention is a key feature for subpage. Now both endio functions only need to call that new function. And since the only caller of btrfs_dec_test_first_ordered_pending() is removed, also remove btrfs_dec_test_first_ordered_pending() completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Currently we use page Private2 bit to indicate that we have ordered extent for the page range. But the lifespan of it is not consistent, during regular writeback path, there are two locations to clear the same PagePrivate2: T ----- Page marked Dirty | + ----- Page marked Private2, through btrfs_run_dealloc_range() | + ----- Page cleared Private2, through btrfs_writepage_cow_fixup() | in __extent_writepage_io() | ^^^ Private2 cleared for the first time | + ----- Page marked Writeback, through btrfs_set_range_writeback() | in __extent_writepage_io(). | + ----- Page cleared Private2, through | btrfs_writepage_endio_finish_ordered() | ^^^ Private2 cleared for the second time. | + ----- Page cleared Writeback, through btrfs_writepage_endio_finish_ordered() Currently PagePrivate2 is mostly to prevent ordered extent accounting being executed for both endio and invalidatepage. Thus only the one who cleared page Private2 is responsible for ordered extent accounting. But the fact is, in btrfs_writepage_endio_finish_ordered(), page Private2 is cleared and ordered extent accounting is executed unconditionally. The race prevention only happens through btrfs_invalidatepage(), where we wait for the page writeback first, before checking the Private2 bit. This means, Private2 is also protected by Writeback bit, and there is no need for btrfs_writepage_cow_fixup() to clear Priavte2. This patch will change btrfs_writepage_cow_fixup() to just check PagePrivate2, not to clear it. The clearing will happen in either btrfs_invalidatepage() or btrfs_writepage_endio_finish_ordered(). This makes the Private2 bit easier to understand, just meaning the page has unfinished ordered extent attached to it. And this patch is a hard requirement for the incoming refactoring for how we finished ordered IO for endio context, as the coming patch will check Private2 to determine if we need to do the ordered extent accounting. Thus this patch is definitely needed or we will hang due to unfinished ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in end_compressed_bio_write(). It passes compressed pages to btrfs_writepage_endio_finish_ordered(), which is only supposed to accept inode pages. Thankfully the important info here is the inode, so let's pass btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and make @page parameter optional. By this, end_compressed_bio_write() can happily pass page=NULL while still getting everything done properly. Also, to cooperate with such modification, replace @page parameter for trace_btrfs_writepage_end_io_hook() with btrfs_inode. Although this removes page_index info, the existing start/len should be enough for most usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
For subpage metadata, we're reusing two functions for subpage metadata write: - end_bio_extent_buffer_writepage() - write_one_eb() But the truth is, for subpage we just call end_bio_subpage_eb_writepage() without using any bit in end_bio_extent_buffer_writepage(). For write_one_eb(), it's pretty similar, but with a small part of code reused. There is really no need to pollute the existing code path if we're not really using most of them. So this patch will do the following change to separate the subpage metadata write path from regular write path by: - Use end_bio_subpage_eb_writepage() directly as endio in write_one_subpage_eb() - Directly call write_one_subpage_eb() in submit_eb_subpage() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There is a lot of code inside extent_io.c needs both "struct bio **bio_ret" and "unsigned long prev_bio_flags", along with some parameters like "unsigned long bio_flags". Such strange parameters are here for bio assembly. For example, we have such inode page layout: 0 4K 8K 12K |<-- Extent A-->|<- EB->| Then what we do is: - Page [0, 4K) *bio_ret = NULL So we allocate a new bio to bio_ret, Add page [0, 4K) to *bio_ret. - Page [4K, 8K) *bio_ret != NULL We found this page is continuous to *bio_ret, and if we're not at stripe boundary, we add page [4K, 8K) to *bio_ret. - Page [8K, 12K) *bio_ret != NULL But we found this page is not continuous, so we submit *bio_ret, then allocate a new bio, and add page [8K, 12K) to the new bio. This means we need to record both the bio and its bio_flag, but we record them manually using those strange parameter list, other than encapsulating them into their own structure. So this patch will introduce a new structure, btrfs_bio_ctrl, to record both the bio, and its bio_flags. Also, in above case, for all pages added to the bio, we need to check if the new page crosses stripe boundary. This check itself can be time consuming, and we don't really need to do that for each page. This patch also integrates the stripe boundary check into btrfs_bio_ctrl. When a new bio is allocated, the stripe and ordered extent boundary is also calculated, so no matter how large the bio will be, we only calculate the boundaries once, to save some CPU time. The following functions/structures are affected: - struct extent_page_data Replace its bio pointer with structure btrfs_bio_ctrl (embedded structure, not pointer) - end_write_bio() - flush_write_bio() Just change how bio is fetched - btrfs_bio_add_page() Use pre-calculated boundaries instead of re-calculating them. And use @bio_ctrl to replace @bio and @prev_bio_flags. - calc_bio_boundaries() New function - submit_extent_page() callers - btrfs_do_readpage() callers - contiguous_readpages() callers To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab bio. - btrfs_bio_fits_in_ordered_extent() Removed, as now the ordered extent size limit is done at bio allocation time, no need to check for each page range. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Function btrfs_bio_fits_in_stripe() now requires a bio with at least one page added. Or btrfs_get_chunk_map() will fail with -ENOENT. But in fact this requirement is not needed at all, as we can just pass sectorsize for btrfs_get_chunk_map(). This tiny behavior change is important for later subpage refactoring on submit_extent_page(). As for 64K page size, we can have a page range with pgoff=0 and size=64K. If the logical bytenr is just 16K before the stripe boundary, we have to split the page range into two bios. This means, we must check page range against stripe boundary, even adding the range to an empty bio. This tiny refactoring is for the incoming changes, but on its own, regular sectorsize == PAGE_SIZE is not affected anyway. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The parameter @len is not really used in btrfs_bio_fits_in_stripe(), just remove it. It got removed in 42034313 ("btrfs: let callers of btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map utilized it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Currently free space cache inode size is determined by two factors: - block group size - PAGE_SIZE This means, for the same sized block groups, with different PAGE_SIZE, it will result in different inode sizes. This will not be a good thing for subpage support, so change the requirement for PAGE_SIZE to sectorsize. Now for the same 4K sectorsize btrfs, it should result the same inode size no matter what the PAGE_SIZE is. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] For the following file layout, scrub will not be able to repair all these two repairable error, but in fact make one corruption even unrepairable: inode offset 0 4k 8K Mirror 1 |XXXXXX| | Mirror 2 | |XXXXXX| [CAUSE] The root cause is the hard coded PAGE_SIZE, which makes scrub repair to go crazy for subpage. For above case, when reading the first sector, we use PAGE_SIZE other than sectorsize to read, which makes us to read the full range [0, 64K). In fact, after 8K there may be no data at all, we can just get some garbage. Then when doing the repair, we also writeback a full page from mirror 2, this means, we will also writeback the corrupted data in mirror 2 back to mirror 1, leaving the range [4K, 8K) unrepairable. [FIX] This patch will modify the following PAGE_SIZE use with sectorsize: - scrub_print_warning_inode() Remove the min() and replace PAGE_SIZE with sectorsize. The min() makes no sense, as csum is done for the full sector with padding. This fixes a bug that subpage report extra length like: checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test, physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file) Where the error is only 1 sector. - scrub_handle_errored_block() Comments with PAGE|page involved, all changed to sector. - scrub_setup_recheck_block() - scrub_repair_page_from_good_copy() - scrub_add_page_to_wr_bio() - scrub_wr_submit() - scrub_add_page_to_rd_bio() - scrub_block_complete() Replace PAGE_SIZE with sectorsize. This solves several problems where we read/write extra range for subpage case. RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE usage, and is not really safe enough. Thus we will reject RAID56 for subpage in later commit. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Instead of calling list_entry with head->prev simply call list_last_entry which makes it obvious which member of the list is being referred. This allows to remove the extra 'prev' pointer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Commit e5d74902 ("btrfs: derive maximum output size in the compression implementation") removed @max_out argument in btrfs_compress_pages() but its comment remained, remove it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Patch "btrfs: reduce compressed_bio member's types" reduced some member's size. Function arguments @len, @compressed_len and @nr_pages can be declared as unsigned int. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Patch "btrfs: reduce compressed_bio member's types" reduced some member's size. Declare the variables @compressed_len, @nr_pages and @pg_index size as an unsigned int in the function btrfs_submit_compressed_read. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Patch "btrfs: reduce compressed_bio member's types" reduced the @nr_pages size to unsigned int, its cascading effects are updated here. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When logging an inode we always log all its xattrs, so that we are able to figure out which ones should be deleted during log replay. However this is unnecessary when we are doing a fast fsync and no xattrs were added, changed or deleted since the last time we logged the inode in the current transaction. So skip the logging of xattrs when the inode was previously logged in the current transaction and no xattrs were added, changed or deleted. If any changes to xattrs happened, than the inode has BTRFS_INODE_COPY_EVERYTHING set in its runtime flags and the xattrs get logged. This saves time on scanning for xattrs, allocating memory, COWing log tree extent buffers and adding more lock contention on the extent buffers when there are multiple tasks logging in parallel. The use of xattrs is common when using ACLs, some applications, or when using security modules like SELinux where every inode gets a security xattr added to it. The following test script, using fio, was used on a box with 12 cores, 64G of RAM, a NVMe device and the default non-debug kernel config from Debian. It uses 8 concurrent jobs each writing in blocks of 64K to its own 4G file, each file with a single xattr of 50 bytes (about the same size for an ACL or SELinux xattr), doing random buffered writes with an fsync after each write. $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/test MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" NUM_JOBS=8 FILE_SIZE=4G cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=1 fallocate=none group_reporting=1 direct=0 bs=64K ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo "Creating files before fio runs, each with 1 xattr of 50 bytes" for ((i = 0; i < $NUM_JOBS; i++)); do path="$MNT/writers.$i.0" truncate -s $FILE_SIZE $path setfattr -n user.xa1 -v $(printf '%0.sX' $(seq 50)) $path done fio /tmp/fio-job.ini umount $MNT fio output before this change: WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=272145-272145msec fio output after this change: WRITE: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=32.0GiB (34.4GB), run=230408-230408msec +16.8% throughput, -16.6% runtime Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Accept device name "cancel" as a request to cancel running device deletion operation. The string is literal, in case there's a real device named "cancel", pass it as full absolute path or as "./cancel" This works for v1 and v2 ioctls when the device is specified by name. Moving chunks from the device uses relocation, use the conditional exclusive operation start and cancellation helpers Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-