- 14 Feb, 2015 4 commits
-
-
Josef Bacik authored
We have this weird dance where we always inc outstanding_extents when we do a O_DIRECT write, even if we allocate the entire range. To get around this we also drop the metadata space if we successfully write. This is an unnecessary dance, we only need to jack up outstanding_extents if we don't satisfy the entire range request in get_blocks_direct, otherwise we are good using our original reservation. So drop the unconditional inc and the drop of the metadata space that we have for the unconditional inc. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
Btrfs will report NO_SPACE when we create and remove files for several times, and we can't write to filesystem until mount it again. Steps to reproduce: 1: Create a single-dev btrfs fs with default option 2: Write a file into it to take up most fs space 3: Delete above file 4: Wait about 100s to let chunk removed 5: goto 2 Script is like following: #!/bin/bash # Recommend 1.2G space, too large disk will make test slow DEV="/dev/sda16" MNT="/mnt/tmp" dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2 file_size_m=$((dev_size * 75 / 100 / 1024 / 1024)) echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev" for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done echo "mkfs $DEV" mkfs.btrfs -f "$DEV" >/dev/null || exit 2 echo "mount $DEV $MNT" mount "$DEV" "$MNT" || exit 2 for ((loop_i = 0; loop_i < 20; loop_i++)); do echo echo "loop $loop_i" echo "dd file..." cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m") "${cmd[@]}" 2>/dev/null || { # NO_SPACE error triggered echo "dd failed: ${cmd[*]}" exit 1 } echo "rm file..." rm -f "$MNT"/file0 || exit 2 for ((i = 0; i < 10; i++)); do df "$MNT" | tail -1 sleep 10 done done Reason: It is triggered by commit: 47ab2a6c which is used to remove empty block groups automatically, but the reason is not in that patch. Code before works well because btrfs don't need to create and delete chunks so many times with high complexity. Above bug is caused by many reason, any of them can trigger it. Reason1: When we remove some continuous chunks but leave other chunks after, these disk space should be used by chunk-recreating, but in current code, only first create will successed. Fixed by Forrest Liu <forrestl@synology.com> in: Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Reason2: contains_pending_extent() return wrong value in calculation. Fixed by Forrest Liu <forrestl@synology.com> in: Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Reason3: btrfs_check_data_free_space() try to commit transaction and retry allocating chunk when the first allocating failed, but space_info->full is set in first allocating, and prevent second allocating in retry. Fixed in this patch by clear space_info->full in commit transaction. Tested for severial times by above script. Changelog v3->v4: use light weight int instead of atomic_t to record have_remove_bgs in transaction, suggested by: Josef Bacik <jbacik@fb.com> Changelog v2->v3: v2 fixed the bug by adding more commit-transaction, but we only need to reclaim space when we are really have no space for new chunk, noticed by: Filipe David Manana <fdmanana@gmail.com> Actually, our code already have this type of commit-and-retry, we only need to make it working with removed-bgs. v3 fixed the bug with above way. Changelog v1->v2: v1 will introduce a new bug when delete and create chunk in same disk space in same transaction, noticed by: Filipe David Manana <fdmanana@gmail.com> V2 fix this bug by commit transaction after remove block grops. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Suggested-by: Filipe David Manana <fdmanana@gmail.com> Suggested-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
My previous patch "Btrfs: fix scrub race leading to use-after-free" introduced the possibility to sleep in an atomic context, which happens when the scrub_lock mutex is held at the time scrub_pending_bio_dec() is called - this function can be called under an atomic context. Chris ran into this in a debug kernel which gave the following trace: [ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621 [ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress [ 1928.981324] INFO: lockdep is turned off. [ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G W 3.19.0-rc7-mason+ #41 [ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012 [ 1929.022207] ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78 [ 1929.037267] ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8 [ 1929.052315] 0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8 [ 1929.067381] Call Trace: [ 1929.072344] <IRQ> [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e [ 1929.083968] [<ffffffff810856c4>] ___might_sleep+0x174/0x230 [ 1929.095352] [<ffffffff810857d2>] __might_sleep+0x52/0x90 [ 1929.106223] [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0 [ 1929.117951] [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10 [ 1929.129708] [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs] [ 1929.143370] [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs] [ 1929.157191] [<ffffffff812fa603>] bio_endio+0x53/0xa0 [ 1929.167382] [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs] [ 1929.180161] [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs] [ 1929.194318] [<ffffffff812fa603>] bio_endio+0x53/0xa0 [ 1929.204496] [<ffffffff8130401b>] blk_update_request+0x1eb/0x450 [ 1929.216569] [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500 [ 1929.229176] [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0 [ 1929.240740] [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0 [ 1929.252654] [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150 [ 1929.264725] [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170 [ 1929.276635] [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0 [ 1929.288014] [<ffffffff8105d92e>] __do_softirq+0xde/0x600 [ 1929.298885] [<ffffffff8105df6d>] irq_exit+0xbd/0xd0 (...) Fix this by using a reference count on the scrub context structure instead of locking the scrub_lock mutex. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
We try to lock a mutex while the current task state is not TASK_RUNNING, which results in the following warning when CONFIG_DEBUG_LOCK_ALLOC=y: [30736.772501] ------------[ cut here ]------------ [30736.774545] WARNING: CPU: 9 PID: 19972 at kernel/sched/core.c:7300 __might_sleep+0x8b/0xa8() [30736.783453] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff8107499b>] prepare_to_wait+0x43/0x89 [30736.786261] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc psmouse parport pcspkr microcode serio_raw evdev processor thermal_sys i2c_piix4 i2c_core button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring e1000 virtio scsi_mod [30736.794323] CPU: 9 PID: 19972 Comm: fsstress Not tainted 3.19.0-rc7-btrfs-next-5+ #1 [30736.795821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [30736.798788] 0000000000000009 ffff88042743fbd8 ffffffff814248ed ffff88043d32f2d8 [30736.800504] ffff88042743fc28 ffff88042743fc18 ffffffff81045338 0000000000000001 [30736.802131] ffffffff81064514 ffffffff817c52d1 000000000000026d 0000000000000000 [30736.803676] Call Trace: [30736.804256] [<ffffffff814248ed>] dump_stack+0x4c/0x65 [30736.805245] [<ffffffff81045338>] warn_slowpath_common+0xa1/0xbb [30736.806360] [<ffffffff81064514>] ? __might_sleep+0x8b/0xa8 [30736.807391] [<ffffffff81045398>] warn_slowpath_fmt+0x46/0x48 [30736.808511] [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89 [30736.809620] [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89 [30736.810691] [<ffffffff81064514>] __might_sleep+0x8b/0xa8 [30736.811703] [<ffffffff81426eaf>] mutex_lock_nested+0x2f/0x3a0 [30736.812889] [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab [30736.814138] [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf [30736.819878] [<ffffffffa038cfff>] wait_for_writer.isra.12+0x91/0xaa [btrfs] [30736.821260] [<ffffffff810748bd>] ? signal_pending_state+0x31/0x31 [30736.822410] [<ffffffffa0391f0a>] btrfs_sync_log+0x160/0x947 [btrfs] [30736.823574] [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab [30736.824847] [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf [30736.825972] [<ffffffffa036e555>] btrfs_sync_file+0x2b0/0x319 [btrfs] [30736.827684] [<ffffffff8117901a>] vfs_fsync_range+0x21/0x23 [30736.828932] [<ffffffff81179038>] vfs_fsync+0x1c/0x1e [30736.829917] [<ffffffff8117928b>] do_fsync+0x34/0x4e [30736.830862] [<ffffffff811794b3>] SyS_fsync+0x10/0x14 [30736.831819] [<ffffffff8142a512>] system_call_fastpath+0x12/0x17 [30736.832982] ---[ end trace c0b57df60d32ae5c ]--- Fix this my acquiring the mutex after calling finish_wait(), which sets the task's state to TASK_RUNNING. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
- 03 Feb, 2015 13 commits
-
-
Satoru Takeuchi authored
"notused" is not necessary. Set 1 to the first entry is enough. Signed-off-by: Takeuchi Satoru <takeuchi_satoru@jp.fujitsu.com Cc: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
Gui Hecheng authored
o removed an unecessary INIT_LIST_HEAD after LIST_HEAD o merge a declare & INIT_LIST_HEAD pair into one LIST_HEAD Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
Shaohua Li authored
Below test will fail currently: mkfs.ext4 -F /dev/sda btrfs-convert /dev/sda mount /dev/sda /mnt btrfs device add -f /dev/sdb /mnt btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt The reason is there are some block groups with usage 0, but the whole disk hasn't free space to allocate new chunk, so we even can't set such block group readonly. This patch deletes the chunk allocation when setting block group ro. For META, we already have reserve. But for SYSTEM, we don't have, so the check_system_chunk is still required. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Naohiro Aota authored
After submit_one_bio(), `bio' can go away. However submit_extent_page() leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM). It will cause invalid paging request when submit_extent_page() is called next time. I reproduced ENOMEM case with the following script (need CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS). #!/bin/bash dmesgout=dmesg.txt start=100000 end=300000 step=1000 # btrfs options device=/dev/vdb1 directory=/mnt/btrfs # fault-injection options percent=100 times=3 mkdir -p $directory || exit 1 mount -o compress $device $directory || exit 1 rm -f $directory/file || exit 1 dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1 for interval in `seq $start $step $end`; do dmesg -C echo 1 > /proc/sys/vm/drop_caches sync export FAILCMD_TYPE=fail_page_alloc ./failcmd.sh -p $percent -t $times -i $interval \ --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \ -- \ cat $directory/file > /dev/null dmesg > ${dmesgout} if grep -q BUG: ${dmesgout}; then cat ${dmesgout} exit 1 fi done umount $directory exit 0 Signed-off-by: Naohiro Aota <naota@elisp.net> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got the following trace: [68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50 [68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060 [68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo [68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4 [68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs] [68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000 [68127.807663] RIP: 0010:[<ffffffff8107da31>] [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] RSP: 0018:ffff8803f097fbb8 EFLAGS: 00010093 [68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854 [68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001 [68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001 [68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390 [68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00 [68127.807663] FS: 0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000 [68127.807663] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0 [68127.807663] Stack: [68127.807663] ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000 [68127.807663] ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314 [68127.807663] ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50 [68127.807663] Call Trace: [68127.807663] [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55 [68127.807663] [<ffffffff81072948>] ? __wake_up+0x22/0x4b [68127.807663] [<ffffffff81072948>] __wake_up+0x22/0x4b [68127.807663] [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs] [68127.807663] [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs] [68127.807663] [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28 [68127.807663] [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9 [68127.807663] [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs] [68127.807663] [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs] [68127.807663] [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6 [68127.807663] [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf [68127.807663] [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8 [68127.807663] [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219 [68127.807663] [<ffffffff8105cd75>] kthread+0xdb/0xe3 [68127.807663] [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67 [68127.807663] [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0 [68127.807663] [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67 [68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75 [68127.807663] RIP [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] RSP <ffff8803f097fbb8> [68127.807663] CR2: ffff8803f8947a50 [68127.807663] ---[ end trace d7045aac00a66cd8 ]--- This is due to a race that can happen in a very tiny time window and is illustrated by the following sequence diagram: CPU 1 CPU 2 btrfs_scrub_dev() scrub_bio_end_io_worker() scrub_pending_bio_dec() atomic_dec(&sctx->bios_in_flight) wait sctx->bios_in_flight == 0 wait sctx->workers_pending == 0 mutex_lock(&fs_info->scrub_lock) (...) mutex_lock(&fs_info->scrub_lock) scrub_free_ctx(sctx) kfree(sctx) wake_up(&sctx->list_wait) __wake_up() spin_lock_irqsave(&sctx->list_wait->lock, flags) Another variation of this scenario that results in the same use-after-free issue is: CPU 1 CPU 2 btrfs_scrub_dev() wait sctx->bios_in_flight == 0 scrub_bio_end_io_worker() scrub_pending_bio_dec() __wake_up(&sctx->list_wait) spin_lock_irqsave(&sctx->list_wait->lock, flags) default_wake_function() wake up task at CPU 2 wait sctx->workers_pending == 0 mutex_lock(&fs_info->scrub_lock) (...) mutex_lock(&fs_info->scrub_lock) scrub_free_ctx(sctx) kfree(sctx) spin_unlock_irqrestore(&sctx->list_wait->lock, flags) Fix this by holding the scrub lock while doing the wakeup. This isn't a recent regression, the issue as been around since the scrub feature was added (2011, commit a2de733c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
If we failed during initialization of sysfs, we weren't unregistering the top level btrfs sysfs entry nor the debugfs stuff. Not unregistering the top level sysfs entry makes future attempts to reload the btrfs module impossible and the following is reported in dmesg: [ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0() [ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs' [ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs] [ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G W 3.13.0-fdm-btrfs-next-24+ #7 [ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2246.451312] 0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410 [ 2246.451314] ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290 [ 2246.451316] ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80 [ 2246.451318] Call Trace: [ 2246.451322] [<ffffffff816f1da6>] dump_stack+0x4e/0x68 [ 2246.451324] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0 [ 2246.451325] [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50 [ 2246.451328] [<ffffffff81367dc5>] ? strlcat+0x65/0x90 (....) This fixes the following change: btrfs: add simple debugfs interface commit 1bae3098Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
Committing a transaction can race with automatic removal of empty block groups (cleaner kthread), leading to a BUG_ON() in the transaction commit code while running btrfs_finish_extent_commit(). The following sequence diagram shows how it can happen: CPU 1 CPU 2 btrfs_commit_transaction() fs_info->running_transaction = NULL btrfs_finish_extent_commit() find_first_extent_bit() -> found range for block group X in fs_info->freed_extents[] btrfs_delete_unused_bgs() -> found block group X Removed block group X's range from fs_info->freed_extents[] btrfs_remove_chunk() btrfs_remove_block_group(bg X) unpin_extent_range(bg X range) btrfs_lookup_block_group(bg X) -> returns NULL -> BUG_ON() The trace that results from the BUG_ON() is: [48665.187808] ------------[ cut here ]------------ [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675! [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000 [48665.197388] RIP: 0010:[<ffffffffa0350d05>] [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP: 0018:ffff8801b56a7b88 EFLAGS: 00010246 [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8 [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0 [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000 [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000 [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000 [48665.197388] FS: 0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [48665.197388] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0 [48665.197388] Stack: [48665.197388] ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff [48665.197388] ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0 [48665.197388] ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227 [48665.197388] Call Trace: [48665.197388] [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs] [48665.197388] [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs] [48665.197388] [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs] [48665.197388] [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33 [48665.197388] [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs] [48665.197388] [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab [48665.197388] [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab [48665.197388] [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf [48665.197388] [<ffffffff8105a55b>] worker_thread+0x210/0x2d0 [48665.197388] [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3 [48665.197388] [<ffffffff8105e5c0>] kthread+0xef/0xf7 [48665.197388] [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b [48665.197388] RIP [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP <ffff8801b56a7b88> [48665.272246] ---[ end trace b9c6ab9957521376 ]--- Fix this by ensuring that unpining the block group's range in btrfs_finish_extent_commit() is done in a synchronized fashion with removing the block group's range from freed_extents[] in btrfs_delete_unused_bgs() This race got introduced with the change: Btrfs: remove empty block groups automatically commit 47ab2a6cSigned-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
Verify that the sys_array has enough bytes to read the next item. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
There's a pointer to buffer, integer offset and offset passed as pointer, try to find matching names for them. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
Verify that possible minimum and maximum size is set, validity of contents is checked in btrfs_read_sys_array. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
I received a few crafted images from Jiri, all got through the recently added superblock checks. The lower bounds checks for num_devices and sector/node -sizes were missing and caused a crash during mount. Tools for symbolic code execution were used to prepare the images contents. Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
chandan r authored
This patch adds a new member to the 'struct btrfs_inode' structure to hold the file creation time. Signed-off-by: chandan <chandanrmail@gmail.com> [refreshed, removed btrfs_inode_otime] Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
They just opencode taking address of the timespec member. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
- 22 Jan, 2015 23 commits
-
-
chandan authored
btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has been taken. Hence assign the appropriate value to path->locks[level]. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Anand Jain authored
There isn't any real use of following members of struct btrfs_root so delete them. struct kobject root_kobj; struct completion kobj_unregister; Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
Yang Dongsheng authored
In function qgroup_excl_accounting(), we need to WARN when qg->excl is less than what we want to free, same to child and parents. But currently, for parent qgroup, the WARN_ON() is located after freeing qg->excl. It will WARN out even we free it normally. This patch move this WARN_ON() before freeing qg->excl. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Liu Bo authored
"run_most" is not used anymore. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
refs is better than ref_count to record a struct's ref count. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Suggested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
So we can check raid56 with: (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) instead of long: (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
Corrent code use many kinds of "clever" way to determine operation target's raid type, as: raid_map != NULL or raid_map[MAX_NR] == RAID[56]_Q_STRIPE To make code easy to maintenance, this patch put raid type into bbio, and we can always get raid type from bbio with a "stupid" way. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
scrub_setup_recheck_block() have many arguments but most of them can be get from one of them, we can remove them to make code clean. Some other cleanup for that function also included in this patch. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
The code are similar, combine them to make code clean and easy to maintenance. Some lost condition are also completed with benefit of this combination. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
Btrfs: Separate finding-right-mirror and writing-to-target's process in scrub_handle_errored_block() In corrent code, code of finding-right-mirror and writing-to-target are mixed in logic, if we find a right mirror but failed in writing to target, it will treat as "hadn't found right block", and fill the target with sblock_bad. Actually, "failed in writing to target" does not mean "source block is wrong", this patch separate above two condition in logic, and do some cleanup to make code clean. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
Use break instead of useless loop should be more suitable in this case. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
1: Remove no-need DEFINE_WAIT(wait) 2: Add likely() for BTRFS_FS_STATE_DEV_REPLACING condition 3: Use while loop instead of goto Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
It is always 1 in this place, because !1 case was already jumped out in previous code. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
if (sctx->is_dev_replace && !is_metadata && !have_csum) { ... goto nodatasum_case; } ... nodatasum_case: WARN_ON(sctx->is_dev_replace); In above code, nodatasum_case marker should be moved after WARN_ON(). Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
It can make code more simple and clear, we need not care about free bbio and raid_map together. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
It can avoid complex calculation of real stripes in sort, moreover, we can clean up code of sorting tgtdev_map because it will be in order initially. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
We add the number of stripes on target devices into bbio->num_stripes if we are under device replacement, and we just sort the raid_map of those stripes that not on the target devices, so if when we need real raid_map, we need skip the stripes on the target devices. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
If we have an inode with a large number of hard links, some of which may be extrefs, turn a regular ref into an extref, fsync the inode and then replay the fsync log (after a crash/reboot), we can endup with an fsync log that makes the replay code always fail with -EOVERFLOW when processing the inode's references. This is easy to reproduce with the test case I made for xfstests. Its steps are the following: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Now remove one link, add a new one with a new name, add another new one with # the same name as the one we just removed and fsync the inode. rm -f $SCRATCH_MNT/foo_link_0001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001 rm -f $SCRATCH_MNT/foo_link_0002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Check that the number of hard links is correct, we are able to remove all # the hard links and read the file's data. This is just to verify we don't # get stale file handle errors (due to dangling directory index entries that # point to inodes that no longer exist). echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)" [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing" for ((i = 1; i <= 3003; i++)); do name=foo_link_`printf "%04d" $i` if [ $i -eq 2 ]; then [ -f $SCRATCH_MNT/$name ] && echo "Link $name found" else [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing" fi done rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo rm -f $SCRATCH_MNT/foo status=0 exit The fix is simply to correct the overflow condition when overwriting a reference item because it was wrong, trying to increase the item in the fs/subvol tree by an impossible amount. Also ensure that we don't insert one normal ref and one ext ref for the same dentry - this happened because processing a dir index entry from the parent in the log happened when the normal ref item was full, which made the logic insert an extref and later when the normal ref had enough room, it would be inserted again when processing the ref item from the child inode in the log. This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. This test only passes if the previous patch titled "Btrfs: fix fsync when extend references are added to an inode" is applied too. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
If we added an extended reference to an inode and fsync'ed it, the log replay code would make our inode have an incorrect link count, which was lower then the expected/correct count. This resulted in stale directory index entries after deleting some of the hard links, and any access to the dangling directory entries resulted in -ESTALE errors because the entries pointed to inode items that don't exist anymore. This is easy to reproduce with the test case I made for xfstests, and the bulk of that test is: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Add one more link to the inode that ends up being a btrfs extref and fsync # the inode. ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now after the fsync log replay btrfs left our inode with a wrong link count N, # which was smaller than the correct link count M (N < M). # So after removing N hard links, the remaining M - N directory entries were # still visible to user space but it was impossible to do anything with them # because they pointed to an inode that didn't exist anymore. This resulted in # stale file handle errors (-ESTALE) when accessing those dentries for example. # # So remove all hard links except the first one and then attempt to read the # file, to verify we don't get an -ESTALE error when accessing the inodel # # The btrfs fsck tool also detected the incorrect inode link count and it # reported an error message like the following: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref # # The fstests framework automatically calls fsck after a test is run, so we # don't need to call fsck explicitly here. rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo status=0 exit So make sure an fsync always flushes the delayed inode item, so that the fsync log contains it (needed in order to trigger the link count fixup code) and fix the extref counting function, which always return -ENOENT to its caller (and made it assume there were always 0 extrefs). This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
If we have an inode (file) with a link count greater than 1, remove one of its hard links, fsync the inode, power fail/crash and then replay the fsync log on the next mount, we end up getting the parent directory's metadata inconsistent - its i_size still reflects the deleted hard link and has dangling index entries (with no matching inode reference entries). This prevents the directory from ever being deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE even if all of its children inodes are deleted, and the dangling index entries can never be removed (as they point to an inode that does not exist anymore). This is easy to reproduce with the following excerpt from the test case for xfstests that I just made: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 2 hard links in the same directory. mkdir -p $SCRATCH_MNT/a/b echo "hello world" > $SCRATCH_MNT/a/b/foo ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar # Make sure all metadata and data are durably persisted. sync # Now remove one of the hard links and fsync the inode. rm -f $SCRATCH_MNT/a/b/bar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Remove the last hard link of the file and attempt to remove its parent # directory - this failed in btrfs because the fsync log and replay code # didn't decrement the parent directory's i_size and left dangling directory # index entries - this made the btrfs rmdir implementation always fail with # the error -ENOTEMPTY. # # The dangling directory index entries were visible to user space, but it was # impossible to do anything on them (unlink, open, read, write, stat, etc) # because the inode they pointed to did not exist anymore. # # The parent directory's metadata inconsistency (stale index entries) was # also detected by btrfs' fsck tool, which is run automatically by the fstests # framework when the test finishes. The error message reported by fsck was: # # root 5 inode 259 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref # rm -f $SCRATCH_MNT/a/b/* rmdir $SCRATCH_MNT/a/b rmdir $SCRATCH_MNT/a To fix this just make sure that after an unlink, if the inode is fsync'ed, he parent inode is fully logged in the fsync log. A test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
Very often our extent buffer's header generation doesn't match the current transaction's id or it is also referenced by other trees (snapshots), so we don't need the corresponding block group cache object. Therefore only search for it if we are going to use it, so we avoid an unnecessary search in the block groups rbtree (and acquiring and releasing its spinlock). Freeing a tree block is performed when COWing or deleting a node/leaf, which implies we are holding the node/leaf's parent node lock, therefore reducing the amount of time spent when freeing a tree block helps reducing the amount of time we are holding the parent node's lock. For example, for a run of xfstests/generic/083, the block group cache object was needed only 682 times for a total of 226691 calls to free a tree block. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-