- 07 Apr, 2016 1 commit
-
-
Filipe Manana authored
If we rename an inode A (be it a file or a directory), create a new inode B with the old name of inode A and under the same parent directory, fsync inode B and then power fail, at log tree replay time we end up removing inode A completely. If inode A is a directory then all its files are gone too. Example scenarios where this happens: This is reproducible with the following steps, taken from a couple of test cases written for fstests which are going to be submitted upstream soon: # Scenario 1 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir -p /mnt/a/x echo "hello" > /mnt/a/x/foo echo "world" > /mnt/a/x/bar sync mv /mnt/a/x /mnt/a/y mkdir /mnt/a/x xfs_io -c fsync /mnt/a/x <power failure happens> The next time the fs is mounted, log tree replay happens and the directory "y" does not exist nor do the files "foo" and "bar" exist anywhere (neither in "y" nor in "x", nor the root nor anywhere). # Scenario 2 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/a echo "hello" > /mnt/a/foo sync mv /mnt/a/foo /mnt/a/bar echo "world" > /mnt/a/foo xfs_io -c fsync /mnt/a/foo <power failure happens> The next time the fs is mounted, log tree replay happens and the file "bar" does not exists anymore. A file with the name "foo" exists and it matches the second file we created. Another related problem that does not involve file/data loss is when a new inode is created with the name of a deleted snapshot and we fsync it: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/testdir btrfs subvolume snapshot /mnt /mnt/testdir/snap btrfs subvolume delete /mnt/testdir/snap rmdir /mnt/testdir mkdir /mnt/testdir xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir <power failure> The next time the fs is mounted the log replay procedure fails because it attempts to delete the snapshot entry (which has dir item key type of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry, resulting in the following error that causes mount to fail: [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [52174.512570] ------------[ cut here ]------------ [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]() [52174.514681] BTRFS: Transaction aborted (error -2) [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G W 4.5.0-rc6-btrfs-next-27+ #1 [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [52174.524053] 0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758 [52174.524053] 0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd [52174.524053] 00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88 [52174.524053] Call Trace: [52174.524053] [<ffffffff81264e93>] dump_stack+0x67/0x90 [52174.524053] [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2 [52174.524053] [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50 [52174.524053] [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff8118f5e9>] ? iput+0xb0/0x284 [52174.524053] [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [52174.524053] [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs] [52174.524053] [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs] [52174.524053] [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs] [52174.524053] [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs] [52174.524053] [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs] [52174.524053] [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs] [52174.524053] [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs] [52174.524053] [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3 [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffff8119590f>] do_mount+0x8a6/0x9e8 [52174.524053] [<ffffffff811358dd>] ? strndup_user+0x3f/0x59 [52174.524053] [<ffffffff81195c65>] SyS_mount+0x77/0x9f [52174.524053] [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]--- Fix this by forcing a transaction commit when such cases happen. This means we check in the commit root of the subvolume tree if there was any other inode with the same reference when the inode we are fsync'ing is a new inode (created in the current transaction). Test cases for fstests, covering all the scenarios given above, were submitted upstream for fstests: * fstests: generic test for fsync after renaming directory https://patchwork.kernel.org/patch/8694281/ * fstests: generic test for fsync after renaming file https://patchwork.kernel.org/patch/8694301/ * fstests: add btrfs test for fsync after snapshot deletion https://patchwork.kernel.org/patch/8670671/ Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
- 06 Apr, 2016 1 commit
-
-
Chris Mason authored
Merge branch 'misc-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6
-
- 04 Apr, 2016 8 commits
-
-
Yauhen Kharuzhy authored
If device replace entry was found on disk at mounting and its num_write_errors stats counter has non-NULL value, then replace operation will never be finished and -EIO error will be reported by btrfs_scrub_dev() because this counter is never reset. # mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ # btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write errs, 0 uncorr. read errs # btrfs replace start -B 4 /dev/sdg /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ ERROR: ioctl(DEV_REPLACE_START) failed on "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error Reset num_write_errors and num_uncorrectable_read_errors counters in the dev_replace structure before start of replacing. Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Mark Fasheh authored
This patch adds tracepoints to the qgroup code on both the reporting side (insert_dirty_extents) and the accounting side. Taken together it allows us to see what qgroup operations have happened, and what their result was. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
The fd we pass in may not be on a btrfs file system, so don't try to do BTRFS_I() on it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The allocation of node could fail if the memory is too fragmented for a given node size, practically observed with 64k. http://article.gmane.org/gmane.comp.file-systems.btrfs/54689Reported-and-tested-by: Jean-Denis Girard <jd.girard@sysnux.pf> Signed-off-by: David Sterba <dsterba@suse.com>
-
Mark Fasheh authored
create_pending_snapshot() will go readonly on _any_ error return from btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by just making a snapshot and asking it to inherit from an invalid qgroup. For example: $ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo Will cause a transaction abort. Fix this by only throwing errors in btrfs_qgroup_inherit() when we know going readonly is acceptable. The following xfstests test case reproduces this bug: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # remove previous $seqres.full before test rm -f $seqres.full # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs _scratch_mount _run_btrfs_util_prog quota enable $SCRATCH_MNT # The qgroup '1/10' does not exist and should be silently ignored _run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT $SCRATCH_MNT/snap1 _scratch_unmount echo "Silence is golden" status=0 exit Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
As one user in mail list report reproducible balance ENOSPC error, it's better to add more debug info for enospc_debug mount option. Reported-by: Marc Haber <mh+linux-btrfs@zugschlus.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Liu Bo authored
Dan Carpenter's static checker has found this error, it's introduced by commit 64c043de ("Btrfs: fix up read_tree_block to return proper error") It's really supposed to 'break' the loop on error like others. Cc: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Davide Italiano authored
- We call inode_size_ok() only if FL_KEEP_SIZE isn't specified. - As an optimisation we can skip the call if (off + len) isn't greater than the current size of the file. This operation is called under the lock so the less work we do, the better. - If we call inode_size_ok() pass to it the correct value rather than a more conservative estimation. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
- 25 Mar, 2016 1 commit
-
-
Chris Mason authored
Merge branch 'misc-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6
-
- 22 Mar, 2016 4 commits
-
-
Jiri Kosina authored
transaction_kthread() is calling try_to_freeze(), but that's just an expeinsive no-op given the fact that the thread is not marked freezable. After removing this, disk-io.c is now independent on freezer API. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
-
Jiri Kosina authored
cleaner_kthread() is not marked freezable, and therefore calling try_to_freeze() in its context is a pointless no-op. In addition to that, as has been clearly demonstrated by 80ad623e ("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"), it's perfectly valid / legal for cleaner_kthread() to stay scheduled out in an arbitrary place during suspend (in that particular example that was waiting for reading of extent pages), so there is no need to leave any traces of freezer in this kthread. Fixes: 80ad623e ("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()") Fixes: 69624913 ("btrfs: clear PF_NOFREEZE in cleaner_kthread()") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
-
Alex Lyakas authored
csum_dirty_buffer was issuing a warning in case the extent buffer did not look alright, but was still returning success. Let's return error in this case, and also add an additional sanity check on the extent buffer header. The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will, but it is better than to have a silent metadata corruption on disk. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Alex Lyakas authored
Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
- 21 Mar, 2016 1 commit
-
-
Chris Mason authored
Commit c40a3d38 (Btrfs: Compute and look up csums based on sectorsized blocks) changes around how we walk the bios while looking up crcs. There's an inner loop that is jumping to the next bvec based on sectors and before it derefs the next bvec, it needs to make sure we're still in the bio. In this case, the outer loop would have decided to stop moving forward too, and the bvec deref is never actually used for anything. But CONFIG_DEBUG_PAGEALLOC catches it because we're outside our bio. Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com>
-
- 14 Mar, 2016 2 commits
-
-
Adam Buchbinder authored
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Ashish Samant authored
Dont print warning for ENOSPC error unless ENOSPC_DEBUG is enabled. Use btrfs_debug if it is enabled. Signed-off-by: Ashish Samant <ashish.samant@oracle.com> [ preserve the WARN_ON ] Signed-off-by: David Sterba <dsterba@suse.com>
-
- 11 Mar, 2016 6 commits
-
-
Dan Carpenter authored
It's basically harmless if "ref_level" isn't initialized since it's only used for an error message, but it causes a static checker warning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
So that its better organized. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
So that it indicates what it does. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Satoru Takeuchi authored
It's better to show a warning message for the exceptional case that one of objectid (in most case, inode number) reaches its highest value. For example, if inode cache is off and this event happens, we can't create any file even if there are not so many files. This message ease detecting such problem. Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The document in the kernel sources is yet another palce where the documentation would need to be updated, while it is not the primary source. We actively maintain the wiki pages. Signed-off-by: David Sterba <dsterba@suse.com>
-
Rasmus Villemoes authored
This is more readable. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
- 01 Mar, 2016 8 commits
-
-
Filipe Manana authored
When logging that an inode exists, for example as part of a directory fsync operation, we were collecting any ordered extents for the inode but we ended up doing nothing with them except tagging them as processed, by setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a subsequent fsync of that inode (using the LOG_INODE_ALL mode) from collecting and processing them. This created a time window where a second fsync against the inode, using the fast path, ended up not logging the checksums for the new extents but it logged the extents since they were part of the list of modified extents. This happened because the ordered extents were not collected and checksums were not yet added to the csum tree - the ordered extents have not gone through btrfs_finish_ordered_io() yet (which is where we add them to the csum tree by calling inode.c:add_pending_csums()). So fix this by not collecting an inode's ordered extents if we are logging it with the LOG_INODE_EXISTS mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
If we're about to do a fast fsync for an inode and btrfs_inode_in_log() returns false, it's possible that we had an ordered extent in progress (btrfs_finish_ordered_io() not run yet) when we noticed that the inode's last_trans field was not greater than the id of the last committed transaction, but shortly after, before we checked if there were any ongoing ordered extents, the ordered extent had just completed and removed itself from the inode's ordered tree, in which case we end up not logging the inode, losing some data if a power failure or crash happens after the fsync handler returns and before the transaction is committed. Fix this by checking first if there are any ongoing ordered extents before comparing the inode's last_trans with the id of the last committed transaction - when it completes, an ordered extent always updates the inode's last_trans before it removes itself from the inode's ordered tree (at btrfs_finish_ordered_io()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
In the listxattrs handler, we were not listing all the xattrs that are packed in the same btree item, which happens when multiple xattrs have a name that when crc32c hashed produce the same checksum value. Fix this by processing them all. The following test case for xfstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/attr # real QA test starts here _supported_fs generic _supported_os Linux _require_scratch _require_attrs rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a few xattrs. The first 3 xattrs have a name # that when given as input to a crc32c function result in the same checksum. # This made btrfs list only one of the xattrs through listxattrs system call # (because it packs xattrs with the same name checksum into the same btree # item). touch $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile # Now call getfattr with --dump, which calls the listxattrs system call. # It should list all the xattrs we have set before. $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
While running a test with a mix of buffered IO and direct IO against the same files I hit a deadlock reported by the following trace: [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds. [11642.142452] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.146332] kworker/u32:3 D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification. [11642.149771] 0 15282 2 0x00000000 [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs] [11642.154074] ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0 [11642.156722] ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff [11642.159205] 0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541 [11642.161403] Call Trace: [11642.162129] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.163396] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.164871] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.167020] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.167931] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.182320] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf [11642.183762] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33 [11642.185308] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52 [11642.186782] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102 [11642.188217] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102 [11642.189626] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39 [11642.190803] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90 [11642.192158] [<ffffffff8111829f>] __lock_page+0x66/0x68 [11642.193379] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a [11642.194831] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs] [11642.197068] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs] [11642.199188] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs] [11642.200723] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [11642.202465] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs] [11642.203836] [<ffffffff811236bc>] do_writepages+0x23/0x2c [11642.205624] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61 [11642.207057] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15 [11642.208529] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs] [11642.210375] [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs] [11642.212132] [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs] [11642.213837] [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs] [11642.215457] [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] [11642.217095] [<ffffffff8106483e>] process_one_work+0x256/0x48b [11642.218324] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7 [11642.219466] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289 [11642.220801] [<ffffffff8106a500>] kthread+0xd4/0xdc [11642.222032] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.223190] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70 [11642.224394] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.226295] 2 locks held by kworker/u32:3/15282: [11642.227273] #0: ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.229412] #1: ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds. [11642.232872] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.235776] kworker/u32:8 D ffff88020de5f848 0 15289 2 0x00000000 [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481) [11642.238670] ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0 [11642.240475] ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff [11642.242154] 0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541 [11642.243715] Call Trace: [11642.244390] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.245432] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.246392] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.247479] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.248551] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.249968] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf [11642.251043] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33 [11642.252202] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52 [11642.253210] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102 [11642.254307] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102 [11642.256118] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39 [11642.257131] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90 [11642.258200] [<ffffffff8111829f>] __lock_page+0x66/0x68 [11642.259168] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a [11642.260516] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs] [11642.261841] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs] [11642.263531] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs] [11642.264747] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [11642.266148] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs] [11642.267264] [<ffffffff811236bc>] do_writepages+0x23/0x2c [11642.268280] [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba [11642.269407] [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d [11642.270476] [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae [11642.271547] [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c [11642.272588] [<ffffffff81194821>] wb_workfn+0x201/0x341 [11642.273523] [<ffffffff81194821>] ? wb_workfn+0x201/0x341 [11642.274479] [<ffffffff8106483e>] process_one_work+0x256/0x48b [11642.275497] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7 [11642.276518] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289 [11642.277520] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289 [11642.278517] [<ffffffff8106a500>] kthread+0xd4/0xdc [11642.279371] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.280468] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70 [11642.281607] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.282604] 3 locks held by kworker/u32:8/15289: [11642.283423] #0: ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.285629] #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.287538] #2: (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds. [11642.290547] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.292864] fdm-stress D ffff88022c107c20 0 26848 26591 0x00000000 [11642.294118] ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0 [11642.295602] ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff [11642.297098] ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541 [11642.298433] Call Trace: [11642.298896] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.299738] [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs] [11642.300833] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44 [11642.301943] [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs] [11642.303270] [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs] [11642.304552] [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs] [11642.305782] [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs] [11642.306878] [<ffffffff8116e298>] __vfs_write+0x7c/0xa5 [11642.307729] [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8 [11642.308602] [<ffffffff8116efbb>] SyS_write+0x50/0x7e [11642.309410] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.310403] 3 locks held by fdm-stress/26848: [11642.311108] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40 [11642.312578] #1: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0 [11642.314170] #2: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs] [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds. [11642.317842] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.319959] fdm-stress D ffff8801964ffa68 0 26849 26591 0x00000000 [11642.321312] ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0 [11642.322555] ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002 [11642.323715] ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541 [11642.325096] Call Trace: [11642.325532] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.326303] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.327180] [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74 [11642.328114] [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a [11642.329051] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.330053] [<ffffffff8147bceb>] __wait_for_common+0x109/0x147 [11642.330952] [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147 [11642.331869] [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a [11642.332925] [<ffffffff81074075>] ? wake_up_q+0x47/0x47 [11642.333736] [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26 [11642.334672] [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs] [11642.335858] [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs] [11642.336854] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44 [11642.337820] [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs] [11642.339026] [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs] [11642.340214] [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs] [11642.341123] [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10 [11642.341934] [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4] [11642.342936] [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57 [11642.343772] [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d [11642.344673] [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc [11642.346024] [<ffffffff81186bbe>] ? __fget_light+0x62/0x71 [11642.346873] [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79 [11642.347720] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.350222] 4 locks held by fdm-stress/26849: [11642.350898] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0 [11642.352375] #1: (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs] [11642.354072] #2: (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs] [11642.355647] #3: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds. [11642.358508] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.368625] fdm-stress D ffff88021f167688 0 26850 26591 0x00000000 [11642.369716] ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0 [11642.370950] ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff [11642.372210] 0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541 [11642.373430] Call Trace: [11642.373853] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.374623] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.375948] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.376862] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.377637] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.378610] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf [11642.379457] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33 [11642.380366] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52 [11642.381353] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102 [11642.382255] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102 [11642.383162] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39 [11642.383945] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90 [11642.384875] [<ffffffff8111829f>] __lock_page+0x66/0x68 [11642.385749] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a [11642.386721] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs] [11642.387596] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs] [11642.389030] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs] [11642.389973] [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69 [11642.390939] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [11642.392271] [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs] [11642.393305] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs] [11642.394239] [<ffffffff811236bc>] do_writepages+0x23/0x2c [11642.395045] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61 [11642.395991] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15 [11642.397144] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs] [11642.398392] [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs] [11642.399363] [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs] [11642.400445] [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54 [11642.401309] [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111 [11642.402213] [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24 [11642.403139] [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs] [11642.404360] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs] [11642.406187] [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33 [11642.407070] [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33 [11642.407990] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs] [11642.409192] [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs] [11642.410146] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs] [11642.411291] [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1 [11642.412263] [<ffffffff8108ac05>] ? mark_lock+0x24/0x201 [11642.413057] [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d [11642.413897] [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2 [11642.414708] [<ffffffff8116ef3d>] SyS_read+0x50/0x7e [11642.415573] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.416572] 1 lock held by fdm-stress/26850: [11642.417345] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40 [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds. [11642.419698] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.421807] fdm-stress D ffff880196483d28 0 26851 26591 0x00000000 [11642.422878] ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0 [11642.424149] ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740 [11642.425374] ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541 [11642.426591] Call Trace: [11642.427013] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.427856] [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24 [11642.428852] [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4 [11642.429743] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.430911] [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.432102] [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs] [11642.433259] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.434431] [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs] [11642.436079] [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs] [11642.437009] [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c [11642.437860] [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22 [11642.438723] [<ffffffff81171435>] iterate_supers+0x75/0xc2 [11642.439597] [<ffffffff81197d00>] sys_sync+0x52/0x80 [11642.440454] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.441533] 3 locks held by fdm-stress/26851: [11642.442370] #0: (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2 [11642.444043] #1: (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs] [11642.446010] #2: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs] This happened because under specific timings the path for direct IO reads can deadlock with concurrent buffered writes. The diagram below shows how this happens for an example file that has the following layout: [ extent A ] [ extent B ] [ .... 0K 4K 8K CPU 1 CPU 2 CPU 3 DIO read against range [0K, 8K[ starts btrfs_direct_IO() --> calls btrfs_get_blocks_direct() which finds the extent map for the extent A and leaves the range [0K, 4K[ locked in the inode's io tree buffered write against range [4K, 8K[ starts __btrfs_buffered_write() --> dirties page at 4K a user space task calls sync for e.g or writepages() is invoked by mm writepages() run_delalloc_range() cow_file_range() --> ordered extent X for the buffered write is created and writeback starts --> calls btrfs_get_blocks_direct() again, without submitting first a bio for reading extent A, and finds the extent map for extent B --> calls lock_extent_direct() --> locks range [4K, 8K[ --> finds ordered extent X covering range [4K, 8K[ --> unlocks range [4K, 8K[ buffered write against range [0K, 8K[ starts __btrfs_buffered_write() prepare_pages() --> locks pages with offsets 0 and 4K lock_and_cleanup_extent_if_need() --> blocks attempting to lock range [0K, 8K[ in the inode's io tree, because the range [0, 4K[ is already locked by the direct IO task at CPU 1 --> calls btrfs_start_ordered_extent(oe X) btrfs_start_ordered_extent(oe X) --> At this point writeback for ordered extent X has not finished yet filemap_fdatawrite_range() btrfs_writepages() extent_writepages() extent_write_cache_pages() --> finds page with offset 0 with the writeback tag (and not dirty) --> tries to lock it --> deadlock, task at CPU 2 has the page locked and is blocked on the io range [0, 4K[ that was locked earlier by this task So fix this by falling back to a buffered read in the direct IO read path when an ordered extent for a buffered write is found. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
When using the same file as the source and destination for a dedup (extent_same ioctl) operation we were allowing it to dedup to a destination offset beyond the file's size, which doesn't make sense and it's not allowed for the case where the source and destination files are not the same file. This made de deduplication operation successful only when the source range corresponded to a hole, a prealloc extent or an extent with all bytes having a value of 0x00. This was also leaving a file hole (between i_size and destination offset) without the corresponding file extent items, which can be reproduced with the following steps for example: $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar wrote 404990/404990 bytes at offset 304457 395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec) $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792 Deduping 2 total files (28672, 24576): /mnt/sdi/foobar (929792, 24576): /mnt/sdi/foobar 1 files asked to be deduped i: 0, status: 0, bytes_deduped: 24576 24576 total bytes deduped in this operation $ umount /mnt/sdi $ btrfsck /dev/sdi Checking filesystem on /dev/sdi UUID: 98c528aa-0833-427d-9403-b98032ffbf9d checking extents checking free space cache checking fs roots root 5 inode 257 errors 100, file extent discount Found file extent holes: start: 712704, len: 217088 found 540673 bytes used err is 1 total csum bytes: 400 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123675 file data blocks allocated: 671744 referenced 671744 btrfs-progs v4.2.3 So fix this by not allowing the destination to go beyond the file's size, just as we do for the same where the source and destination files are not the same. A test for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
We have two cases where we end up deleting a file at log replay time when we should not. For this to happen the file must have been renamed and a directory inode must have been fsynced/logged. Two examples that exercise these two cases are listed below. Case 1) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/c $ touch /mnt/a/b/foo $ sync $ mv /mnt/a/b/foo /mnt/c/ # Create file bar just to make sure the fsync on directory a/ does # something and it's not a no-op. $ touch /mnt/a/bar $ xfs_io -c "fsync" /mnt/a < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file foo. Case 2) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a $ mkdir /mnt/b $ mkdir /mnt/c $ touch /mnt/a/foo $ ln /mnt/a/foo /mnt/b/foo_link $ touch /mnt/b/bar $ sync $ unlink /mnt/b/foo_link $ mv /mnt/b/bar /mnt/c/ $ xfs_io -c "fsync" /mnt/a/foo < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file bar. The reason why the files are deleted is because when we log inodes other then the fsync target inode, we ignore their last_unlink_trans value and leave the log without enough information to later replay the rename operations. So we need to look at the last_unlink_trans values and fallback to a transaction commit if they are greater than the id of the last committed transaction. So fix this by looking at the last_unlink_trans values and fallback to transaction commits when needed. Also, when logging other inodes (for case 1 we logged descendants of the fsync target inode while for case 2 we logged ascendants) we need to care about concurrent tasks updating the last_unlink_trans of inodes we are logging (which was already an existing problem in check_parent_dirs_for_sync()). Since we can not acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes deadlocks with other concurrent operations that acquire the i_mutex of 2 inodes (other fsyncs or renames for example), we need to serialize on the log_mutex of the inode we are logging. A task setting a new value for an inode's last_unlink_trans must acquire the inode's log_mutex and it must do this update before doing the actual unlink operation (which is already the case except when deleting a snapshot). Conversely the task logging the inode must first log the inode and then check the inode's last_unlink_trans value while holding its log_mutex, as if its value is not greater then the id of the last committed transaction it means it logged a safe state of the inode's items, while if its value is not smaller then the id of the last committed transaction it means the inode state it has logged might not be safe (the concurrent task might have just updated last_unlink_trans but hasn't done yet the unlink operation) and therefore a transaction commit must be done. Test cases for xfstests follow in separate patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
If we delete a snapshot, fsync its parent directory and crash/power fail before the next transaction commit, on the next mount when we attempt to replay the log tree of the root containing the parent directory we will fail and prevent the filesystem from mounting, which is solvable by wiping out the log trees with the btrfs-zero-log tool but very inconvenient as we will lose any data and metadata fsynced before the parent directory was fsynced. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ btrfs subvolume snapshot /mnt /mnt/testdir/snap $ btrfs subvolume delete /mnt/testdir/snap $ xfs_io -c "fsync" /mnt/testdir < crash / power failure and reboot > $ mount /dev/sdc /mnt mount: mount(2) failed: No such file or directory And in dmesg/syslog we get the following message and trace: [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [192066.363010] ------------[ cut here ]------------ [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]() [192066.367250] BTRFS: Transaction aborted (error -2) [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G W 4.4.0-rc6-btrfs-next-20+ #1 [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [192066.380889] 0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8 [192066.382561] ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe [192066.384191] ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710 [192066.385827] Call Trace: [192066.386373] [<ffffffff81257570>] dump_stack+0x4e/0x79 [192066.387387] [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2 [192066.388429] [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.389236] [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50 [192066.389884] [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.390621] [<ffffffff81184b55>] ? iput+0xb0/0x266 [192066.391200] [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [192066.391930] [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs] [192066.392715] [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs] [192066.393510] [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs] [192066.394241] [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs] [192066.394958] [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs] [192066.395628] [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs] [192066.396790] [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs] [192066.397891] [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs] [192066.398897] [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs] [192066.399823] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.400739] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.401700] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.402482] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.403930] [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs] [192066.404831] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.405726] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.406621] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.407401] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.408247] [<ffffffff8118ae36>] do_mount+0x893/0x9d2 [192066.409047] [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c [192066.409842] [<ffffffff8118b187>] SyS_mount+0x75/0xa1 [192066.410621] [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]--- [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree) [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30 [192066.444613] BTRFS: open_ctree failed This happens because when we are replaying the log and processing the directory entry pointing to the snapshot in the subvolume tree, we treat its btrfs_dir_item item as having a location with a key type matching BTRFS_INODE_ITEM_KEY, which is wrong because the type matches BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the object id refers to a root number and not to an inode in the root containing the parent directory. So fix this by triggering a transaction commit if an fsync against the parent directory is requested after deleting a snapshot. This is the simplest approach for a rare use case. Some alternative that avoids the transaction commit would require more code to explicitly delete the snapshot at log replay time (factoring out common code from ioctl.c: btrfs_ioctl_snap_destroy()), special care at fsync time to remove the log tree of the snapshot's root from the log root of the root of tree roots, amongst other steps. A test case for xfstests that triggers the issue follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create a snapshot at the root of our filesystem (mount point path), delete it, # fsync the mount point path, crash and mount to replay the log. This should # succeed and after the filesystem is mounted the snapshot should not be visible # anymore. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT _flakey_drop_and_remount [ -e $SCRATCH_MNT/snap1 ] && \ echo "Snapshot snap1 still exists after log replay" # Similar scenario as above, but this time the snapshot is created inside a # directory and not directly under the root (mount point path). mkdir $SCRATCH_MNT/testdir _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir _flakey_drop_and_remount [ -e $SCRATCH_MNT/testdir/snap2 ] && \ echo "Snapshot snap2 still exists after log replay" _unmount_flakey echo "Silence is golden" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Chris Mason authored
Merge tag 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6 Btrfs patchsets for 4.6
-
- 28 Feb, 2016 8 commits
-
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull perf fixes from Thomas Gleixner: "A rather largish series of 12 patches addressing a maze of race conditions in the perf core code from Peter Zijlstra" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Robustify task_function_call() perf: Fix scaling vs. perf_install_in_context() perf: Fix scaling vs. perf_event_enable() perf: Fix scaling vs. perf_event_enable_on_exec() perf: Fix ctx time tracking by introducing EVENT_TIME perf: Cure event->pending_disable race perf: Fix race between event install and jump_labels perf: Fix cloning perf: Only update context time when active perf: Allow perf_release() with !event->ctx perf: Do not double free perf: Close install vs. exit race
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull x86 fixes from Thomas Gleixner: "This update contains: - Hopefully the last ASM CLAC fixups - A fix for the Quark family related to the IMR lock which makes kexec work again - A off-by-one fix in the MPX code. Ironic, isn't it? - A fix for X86_PAE which addresses once more an unsigned long vs phys_addr_t hickup" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mpx: Fix off-by-one comparison with nr_registers x86/mm: Fix slow_virt_to_phys() for X86_PAE again x86/entry/compat: Add missing CLAC to entry_INT80_32 x86/entry/32: Add an ASM_CLAC to entry_SYSENTER_32 x86/platform/intel/quark: Change the kernel's IMR lock bit to false
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull scheduler fixlet from Thomas Gleixner: "A trivial printk typo fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix trivial typo in printk() message
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull irq fixes from Thomas Gleixner: "Four small fixes for irqchip drivers: - Add missing low level irq handler initialization on mxs, so interrupts can acutally be delivered - Add a missing barrier to the GIC driver - Two fixes for the GIC-V3-ITS driver, addressing a double EOI write and a cache flush beyond the actual region" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3: Add missing barrier to 32bit version of gic_read_iar() irqchip/mxs: Add missing set_handle_irq() irqchip/gicv3-its: Avoid cache flush beyond ITS_BASERn memory size irqchip/gic-v3-its: Fix double ICC_EOIR write for LPI in EOImode==1
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/stagingLinus Torvalds authored
Pull staging/android fix from Greg KH: "Here is one patch, for the android binder driver, to resolve a reported problem. Turns out it has been around for a while (since 3.15), so it is good to finally get it resolved. It has been in linux-next for a while with no reported issues" * tag 'staging-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usbLinus Torvalds authored
Pull USB fixes from Greg KH: "Here are a few USB fixes for 4.5-rc6 They fix a reported bug for some USB 3 devices by reverting the recent patch, a MAINTAINERS change for some drivers, some new device ids, and of course, the usual bunch of USB gadget driver fixes. All have been in linux-next for a while with no reported issues" * tag 'usb-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: MAINTAINERS: drop OMAP USB and MUSB maintainership usb: musb: fix DMA for host mode usb: phy: msm: Trigger USB state detection work in DRD mode usb: gadget: net2280: fix endpoint max packet for super speed connections usb: gadget: gadgetfs: unregister gadget only if it got successfully registered usb: gadget: remove driver from pending list on probe error Revert "usb: hub: do not clear BOS field during reset device" usb: chipidea: fix return value check in ci_hdrc_pci_probe() usb: chipidea: error on overflow for port_test_write USB: option: add "4G LTE usb-modem U901" USB: cp210x: add IDs for GE B650V3 and B850V3 boards USB: option: add support for SIM7100E usb: musb: Fix DMA desired mode for Mentor DMA engine usb: gadget: fsl_qe_udc: fix IS_ERR_VALUE usage usb: dwc2: USB_DWC2 should depend on HAS_DMA usb: dwc2: host: fix the data toggle error in full speed descriptor dma usb: dwc2: host: fix logical omissions in dwc2_process_non_isoc_desc usb: dwc3: Fix assignment of EP transfer resources usb: dwc2: Add extra delay when forcing dr_mode
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull vfs fixes from Al Viro. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_last(): ELOOP failure exit should be done after leaving RCU mode should_follow_link(): validate ->d_seq after having decided to follow namei: ->d_inode of a pinned dentry is stable only for positives do_last(): don't let a bogus return value from ->open() et.al. to confuse us fs: return -EOPNOTSUPP if clone is not supported hpfs: don't truncate the file when delete fails
-