- 03 Oct, 2014 10 commits
-
-
Sage Weil authored
We check whether transid is already committed via last_trans_committed and then search through trans_list for pending transactions. If last_trans_committed is updated by btrfs_commit_transaction after we check it (there is no locking), we will fail to find the committed transaction and return EINVAL to the caller. This has been observed occasionally by ceph-osd (which uses this ioctl heavily). Fix by rechecking whether the provided transid <= last_trans_committed after the search fails, and if so return 0. Signed-off-by: Sage Weil <sage@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
While we have a transaction ongoing, the VM might decide at any time to call btree_inode->i_mapping->a_ops->writepages(), which will start writeback of dirty pages belonging to btree nodes/leafs. This call might return an error or the writeback might finish with an error before we attempt to commit the running transaction. If this happens, we might have no way of knowing that such error happened when we are committing the transaction - because the pages might no longer be marked dirty nor tagged for writeback (if a subsequent modification to the extent buffer didn't happen before the transaction commit) which makes filemap_fdata[write|wait]_range unable to find such pages (even if they're marked with SetPageError). So if this happens we must abort the transaction, otherwise we commit a super block with btree roots that point to btree nodes/leafs whose content on disk is invalid - either garbage or the content of some node/leaf from a past generation that got cowed or deleted and is no longer valid (for this later case we end up getting error messages like "parent transid verify failed on 10826481664 wanted 25748 found 29562" when reading btree nodes/leafs from disk). Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's i_mapping would not be enough because we need to distinguish between log tree extents (not fatal) vs non-log tree extents (fatal) and because the next call to filemap_fdatawait_range() will catch and clear such errors in the mapping - and that call might be from a log sync and not from a transaction commit, which means we would not know about the error at transaction commit time. Also, checking for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't done and would not be completely reliable, as the eb might be removed from memory and read back when trying to get it, which clears that flag right before reading the eb's pages from disk, making us not know about the previous write error. Using the new 3 flags for the btree inode also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages() returns success, started writeback for all dirty pages and before filemap_fdatawait_range() is called, the writeback for all dirty pages had already finished with errors - because we were not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as it could not know that writeback errors happened (the pages were no longer tagged for writeback). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Fabian Frederick authored
Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS and keep prototype in qgroup.h only Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Chris Mason <clm@fb.com>
-
Fabian Frederick authored
cmp was declared twice in btrfs_compare_trees resulting in a shadow warning. This patch renames second internal variable. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Chris Mason <clm@fb.com>
-
Fabian Frederick authored
bi_sector and bi_size moved to bi_iter since commit 4f024f37 ("block: Abstract out bvec iterator") Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Chris Mason <clm@fb.com>
-
Liu Bo authored
This is actually inspired by Filipe's patch. When write_one_eb() fails on submit_extent_page(), it'll give up writing this eb and mark it with EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure, there are some left pages which remain DIRTY, and if a later COW on this eb happens, ie. eb is COWed and freed, it'd run into BUG_ON in btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page)); This adds the missing clear_page_dirty_for_io() for the rest pages of eb. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
If submit_extent_page() fails in write_one_eb(), we end up with the current page not marked dirty anymore, unlocked and marked for writeback. But we never end up calling end_page_writeback() against the page, which will make calls to filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting for the writeback bit to be cleared from the page. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map is using wrong condition to judgement whether the range is a subset of a existing extent map. This may cause bug in btrfs no-holes mode. This patch will correct the judgment and fix the bug. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
Marc Merlin sent me a broken fs image months ago where it would blow up in the upper->checked BUG_ON() in build_backref_tree. This is because we had a scenario like this block a -- level 4 (not shared) | block b -- level 3 (reloc block, shared) | block c -- level 2 (not shared) | block d -- level 1 (shared) | block e -- level 0 (shared) We go to build a backref tree for block e, we notice block d is shared and add it to the list of blocks to lookup it's backrefs for. Now when we loop around we will check edges for the block, so we will see we looked up block c last time. So we lookup block d and then see that the block that points to it is block c and we can just skip that edge since we've already been up this path. The problem is because we clear need_check when we see block d (as it is shared) we never add block b as needing to be checked. And because block c is in our path already we bail out before we walk up to block b and add it to the backref check list. To fix this we need to reset need_check if we trip over a block that doesn't need to be checked. This will make sure that any subsequent blocks in the path as we're walking up afterwards are added to the list to be processed. With this patch I can now mount Marc's fs image and it'll complete the balance without panicing. Thanks, Reported-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
When balance panics it tends to panic in the BUG_ON(!upper->checked); test, because it means it couldn't build the backref tree properly. This is annoying to users and frankly a recoverable error, nothing in this function is actually fatal since it is just an in-memory building of the backrefs for a given bytenr. So go through and change all the BUG_ON()'s to ASSERT()'s, and fix the BUG_ON(!upper->checked) thing to just return an error. This patch also fixes the error handling so it tears down the work we've done properly. This code was horribly broken since we always just panic'ed instead of actually erroring out, so it needed to be completely re-worked. With this patch my broken image no longer panics when I mount it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
- 23 Sep, 2014 3 commits
-
-
Josef Bacik authored
When doing log replay we may have to update inodes, which traditionally goes through our delayed inode stuff. This will try to move space over from the trans handle, but we don't reserve space in our trans handle on replay since we don't know how much we will need, so instead we try to flush. But because we have a trans handle open we won't flush anything, so if we are out of reserve space we will simply return ENOSPC. Since we know that if an operation made it into the log then we definitely had space before the box bought the farm then we don't need to worry about doing this space reservation. Use the fs_info->log_root_recovering flag to skip the delayed inode stuff and update the item directly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
Trying to reproduce a log enospc bug I hit a panic in the async reclaim code during log replay. This is because we use fs_info->fs_root as our root for shrinking and such. Technically we can use whatever root we want, but let's just not allow async reclaim while we're doing log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
- 19 Sep, 2014 2 commits
-
-
Filipe Manana authored
When we do a fast fsync, we start all ordered operations and then while they're running in parallel we visit the list of modified extent maps and construct their matching file extent items and write them to the log btree. After that, in btrfs_sync_log() we wait for all the ordered operations to finish (via btrfs_wait_logged_extents). The problem with this is that we were completely ignoring errors that can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM or -EIO errors for example. When such error happens, it means we have parts of the on disk extent that weren't written to, and so we end up logging file extent items that point to these extents that contain garbage/random data - so after a crash/reboot plus log replay, we get our inode's metadata pointing to those extents. This worked in contrast with the full (non-fast) fsync path, where we start all ordered operations, wait for them to finish and then write to the log btree. In this path, after each ordered operation completes we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return -EIO if so (via btrfs_wait_ordered_range). So if an error happens with any ordered operation, just return a -EIO error to userspace, so that it knows that not all of its previous writes were durably persisted and the application can take proper action (like redo the writes for e.g.) - and definitely not leave any file extent items in the log refer to non fully written extents. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
When the fsync callback (btrfs_sync_file) starts, it first waits for the writeback of any dirty pages to start and finish without holding the inode's mutex (to reduce contention). After this it acquires the inode's mutex and repeats that process via btrfs_wait_ordered_range only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag is set on the inode). This is not safe for a non full sync - we need to start and wait for writeback to finish for any pages that might have been made dirty before acquiring the inode's mutex and after that first step mentioned before. Why this is needed is explained by the following comment added to btrfs_sync_file: "Right before acquiring the inode's mutex, we might have new writes dirtying pages, which won't immediately start the respective ordered operations - that is done through the fill_delalloc callbacks invoked from the writepage and writepages address space operations. So make sure we start all ordered operations before starting to log our inode. Not doing this means that while logging the inode, writeback could start and invoke writepage/writepages, which would call the fill_delalloc callbacks (cow_file_range, submit_compressed_extents). These callbacks add first an extent map to the modified list of extents and then create the respective ordered operation, which means in tree-log.c:btrfs_log_inode() we might capture all existing ordered operations (with btrfs_get_logged_extents()) before the fill_delalloc callback adds its ordered operation, and by the time we visit the modified list of extent maps (with btrfs_log_changed_extents()), we see and process the extent map they created. We then use the extent map to construct a file extent item for logging without waiting for the respective ordered operation to finish - this file extent item points to a disk location that might not have yet been written to, containing random data - so after a crash a log replay will make our inode have file extent items that point to disk locations containing invalid data, as we returned success to userspace without waiting for the respective ordered operation to finish, because it wasn't captured by btrfs_get_logged_extents()." Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
- 18 Sep, 2014 2 commits
-
-
Liu Bo authored
The tracepoint of extent map doesn't parse @flag correctly, we set @flag via set_bit(), so we need to parse it on a bit bias. Also add the missing flag, EXTENT_FLAG_FS_MAPPING. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
The following commit enhanced the merge_extent_mapping() to reduce fragment in extent map tree, but it can't handle case which existing lies before map_start: 51f39 btrfs: Use right extent length when inserting overlap extent map. [BUG] When existing extent map's start is before map_start, the em->len will be minus, which will corrupt the extent map and fail to insert the new extent map. This will happen when someone get a large extent map, but when it is going to insert it into extent map tree, some one has already commit some write and split the huge extent into small parts. [REPRODUCER] It is very easy to tiger using filebench with randomrw personality. It is about 100% to reproduce when using 8G preallocated file in 60s randonrw test. [FIX] This patch can now handle any existing extent position. Since it does not directly use existing->start, now it will find the previous and next extent around map_start. So the old existing->start < map_start bug will never happen again. [ENHANCE] This patch will insert the best fitted extent map into extent map tree, other than the oldest [map_start, map_start + sectorsize) or the relatively newer but not perfect [map_start, existing->start). The patch will first search existing extent that does not intersects with the desired map range [map_start, map_start + len). The existing extent will be either before or behind map_start, and based on the existing extent, we can find out the previous and next extent around map_start. So the best fitted extent would be [prev->end, next->start). For prev or next is not found, em->start would be prev->end and em->end wold be next->start. With this patch, the fragment in extent map tree should be reduced much more than the 51f39 commit and reduce an unneeded extent map tree search. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
- 17 Sep, 2014 23 commits
-
-
Liu Bo authored
An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't prepare for that and convert the negative @offset into unsigned type, so we get (end < start) warning. [ 1269.835374] ------------[ cut here ]------------ [ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140() [ 1269.838816] BTRFS: end < start 4094 18446744073709551615 [ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G W 3.16.0+ #306 [ 1269.858229] Call Trace: [ 1269.858612] [<ffffffff81801a69>] dump_stack+0x4e/0x68 [ 1269.858952] [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0 [ 1269.859416] [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50 [ 1269.859929] [<ffffffff813b0fbd>] insert_state+0x11d/0x140 [ 1269.860409] [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0 [ 1269.860805] [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200 [ 1269.861697] [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0 [ 1269.862168] [<ffffffff811f201e>] SyS_lseek+0xae/0xc0 [ 1269.862620] [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b [ 1269.862970] ---[ end trace 4d33ea885832054b ]--- This assumes that btrfs starts finding DATA/HOLE from the beginning of file if the assigned @offset is negative. Also we add alignment for lock_extent_bits 's range. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
After the data is written successfully, we should cleanup the read failure record in that range because - If we set data COW for the file, the range that the failure record pointed to is mapped to a new place, so it is invalid. - If we set no data COW for the file, and if there is no error during writting, the corrupted data is corrected, so the failure record can be removed. And if some errors happen on the mirrors, we also needn't worry about it because the failure record will be recreated if we read the same place again. Sometimes, we may fail to correct the data, so the failure records will be left in the tree, we need free them when we free the inode or the memory leak happens. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
We need real mirror number for RAID0/5/6 when reading data, or if read error happens, we would pass 0 as the number of the mirror on which the io error happens. It is wrong and would cause the filesystem read the data from the corrupted mirror again. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
We could not use clean_io_failure in the direct IO path because it got the filesystem information from the page structure, but the page in the direct IO bio didn't have the filesystem information in its structure. So we need modify it and pass all the information it need by parameters. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
The original code of repair_io_failure was just used for buffered read, because it got some filesystem data from page structure, it is safe for the page in the page cache. But when we do a direct read, the pages in bio are not in the page cache, that is there is no filesystem data in the page structure. In order to implement direct read data repair, we need modify repair_io_failure and pass all filesystem data it need by function parameters. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
The data repair function of direct read will be implemented later, and some code in bio_readpage_error will be reused, so split bio_readpage_error into several functions which will be used in direct read repair later. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
We forgot to free failure record and bio after submitting re-read bio failed, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
Direct IO splits the original bio to several sub-bios because of the limit of raid stripe, and the filesystem will wait for all sub-bios and then run final end io process. But it was very hard to implement the data repair when dio read failure happens, because at the final end io function, we didn't know which mirror the data was read from. So in order to implement the data repair, we have to move the file data check in the final end io function to the sub-bio end io function, in which we can get the mirror number of the device we access. This patch did this work as the first step of the direct io data repair implementation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
The current code would load checksum data for several times when we split a whole direct read io because of the limit of the raid stripe, it would make us search the csum tree for several times. In fact, it just wasted time, and made the contention of the csum tree root be more serious. This patch improves this problem by loading the data at once. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
rw_devices counter is often used to tune the profile when doing chunk allocation, so we should modify it under the chunk_mutex context to avoid getting wrong chunk profile. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
For a missing device, we don't know it belong to which fs before we read its fsid from the chunk tree. So we add them into the current fs device list at first. When we get its fsid, we should move them to their own fs device list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
When we open a seed filesystem, if the degraded mount option is set, we continue to mount the fs if we don't find some devices in the seed filesystem. But we should stop mounting if other errors happen. Fix it Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
The problem is: Task0(device scan task) Task1(device replace task) scan_one_device() mutex_lock(&uuid_mutex) device = find_device() mutex_lock(&device_list_mutex) lock_chunk() rm_and_free_source_device unlock_chunk() mutex_unlock(&device_list_mutex) check device Destroying the target device if device replace fails also has the same problem. We fix this problem by locking uuid_mutex during destroying source device or target device, just like the device remove operation. It is a temporary solution, we can fix this problem and make the code more clear by atomic counter in the future. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
We can build a new filesystem based a seed filesystem, and we need clone the fs devices when we open the new filesystem. But someone might clear the seed flag of the seed filesystem, then mount that filesystem and remove some device. If we mount the new filesystem, we might access a device list which was being changed when we clone the fs devices. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
There were several problems about chunk mutex usage: - Lock chunk mutex when updating metadata. It would cause the nested deadlock because updating metadata might need allocate new chunks that need acquire chunk mutex. We remove chunk mutex at this case, because b-tree lock and other lock mechanism can help us. - ABBA deadlock occured between device_list_mutex and chunk_mutex. When we update device status, we must acquire device_list_mutex at the beginning, and then we might get chunk_mutex during the device status update because we need allocate new chunks for metadata COW. But at most place, we acquire chunk_mutex at first and then acquire device list mutex. We need change the lock order. - Some place we needn't acquire chunk_mutex. For example we needn't get chunk_mutex when we free a empty seed fs_devices structure. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
When we get the fs information, we forgot to acquire the mutex of device list, it might cause the problem we might access a device that was removed. Fix it by acquiring the device list mutex. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
We didn't protect the system chunk array when we added a new system chunk into it, it would cause the array be corrupted if someone remove/add some system chunk into array at the same time. Fix it by chunk lock. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk lock when we change them, but sometimes we read them without any lock, and we might get unexpected value. We fix this problem like inode's i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
We should update free_chunk_space in time when we allocate a new chunk, not when we deal with the pending device update and block group insertion, because we need the real free_chunk_space data to calculate the reserved space, if we don't update it in time, we would consider the disk space which has be allocated as free space, and would use it to do overcommit reservation. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-