1. 23 Jun, 2016 6 commits
    • Wang Xiaoguang's avatar
      btrfs: fix disk_i_size update bug when fallocate() fails · c0d2f610
      Wang Xiaoguang authored
      When doing truncate operation, btrfs_setsize() will first call
      truncate_setsize() to set new inode->i_size, but if later
      btrfs_truncate() fails, btrfs_setsize() will call
      "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
      inmemory inode size, now bug occurs. It's because for truncate
      case btrfs_ordered_update_i_size() directly uses inode->i_size
      to update BTRFS_I(inode)->disk_i_size, indeed we should use the
      "offset" argument to update disk_i_size. Here is the call graph:
      ==>btrfs_truncate()
      ====>btrfs_truncate_inode_items()
      ======>btrfs_ordered_update_i_size(inode, last_size, NULL);
      Here btrfs_ordered_update_i_size()'s offset argument is last_size.
      
      And below test case can reveal this bug:
      
      dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
      dev=$(losetup --show -f fs.img)
      mkdir -p /mnt/mntpoint
      mkfs.btrfs  -f $dev
      mount $dev /mnt/mntpoint
      cd /mnt/mntpoint
      
      echo "workdir is: /mnt/mntpoint"
      blocksize=$((128 * 1024))
      dd if=/dev/zero of=testfile bs=$blocksize count=1
      sync
      count=$((17*1024*1024*1024/blocksize))
      echo "file size is:" $((count*blocksize))
      for ((i = 1; i <= $count; i++)); do
      	i=$((i + 1))
      	dst_offset=$((blocksize * i))
      	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
      		testfile > /dev/null
      done
      sync
      
      truncate --size 0 testfile
      ls -l testfile
      du -sh testfile
      exit
      
      In this case, truncate operation will fail for enospc reason and
      "du -sh testfile" returns value greater than 0, but testfile's
      size is 0, we need to reflect correct inode->i_size.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c0d2f610
    • Liu Bo's avatar
      Btrfs: fix error handling in map_private_extent_buffer · 415b35a5
      Liu Bo authored
      map_private_extent_buffer() can return -EINVAL in two different cases,
      1. when the requested contents span two pages if nodesize is larger
         than pagesize,
      2. when it detects something insane.
      
      The 2nd one used to be only a WARN_ON(1), and we decided to return a error
      to callers, but we didn't fix up all its callers, which will be
      addressed by this patch.
      
      Without this, btrfs may end up with 'general protection', ie.
      reading invalid memory.
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      415b35a5
    • Wei Yongjun's avatar
      Btrfs: fix error return code in btrfs_init_test_fs() · 04e1b65a
      Wei Yongjun authored
      Fix to return a negative error code from the kern_mount() error handling
      case instead of 0(ret is set to 0 by register_filesystem), as done
      elsewhere in this function.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      04e1b65a
    • Josef Bacik's avatar
      Btrfs: don't do nocow check unless we have to · c6887cd1
      Josef Bacik authored
      Before we write into prealloc/nocow space we have to make sure that there are no
      references to the extents we are writing into, which means checking the extent
      tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
      unless we can't reserve data space, since it's a serious drag on performance.
      With the following sequence
      
      fallocate -l10737418240 /mnt/btrfs-test/file
      cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
      fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
      	--end_fsync=1
      
      we get the worst case scenario where we have to fall back on to doing the check
      anyway.
      
      Without this patch
      lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
      write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
      
      With this patch
      lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
      write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
      
      We get twice the throughput, half of the runtime, and half of the average
      latency.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      [ PAGE_CACHE_ removal related fixups ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c6887cd1
    • Chris Mason's avatar
      btrfs: fix deadlock in delayed_ref_async_start · 0f873eca
      Chris Mason authored
      "Btrfs: track transid for delayed ref flushing" was deadlocking on
      btrfs_attach_transaction because its not safe to call from the async
      delayed ref start code.  This commit brings back btrfs_join_transaction
      instead and checks for a blocked commit.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0f873eca
    • Josef Bacik's avatar
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik authored
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      31b9655f
  2. 17 Jun, 2016 9 commits
  3. 08 Jun, 2016 2 commits
  4. 06 Jun, 2016 9 commits
  5. 03 Jun, 2016 1 commit
    • Chris Mason's avatar
      Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent · 8dff9c85
      Chris Mason authored
      When dealing with inline extents, btrfs_get_extent will incorrectly try
      to insert a duplicate extent_map.  The dup hits -EEXIST from
      add_extent_map, but then we try to merge with the existing one and end
      up trying to insert a zero length extent_map.
      
      This actually works most of the time, except when there are extent maps
      past the end of the inline extent.  rocksdb will trigger this sometimes
      because it preallocates an extent and then truncates down.
      
      Josef made a script to trigger with xfs_io:
      
      	#!/bin/bash
      
      	xfs_io -f -c "pwrite 0 1000" inline
      	xfs_io -c "falloc -k 4k 1M" inline
      	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
      	xfs_io -c "fadvise -d 0 1000" inline
      	cat inline
      
      You'll get EIOs trying to read inline after this because add_extent_map
      is returning EEXIST
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8dff9c85
  6. 02 Jun, 2016 4 commits
  7. 31 May, 2016 2 commits
    • Josef Bacik's avatar
      Btrfs: end transaction if we abort when creating uuid root · 65d4f4c1
      Josef Bacik authored
      We still need to call btrfs_end_transaction if we call btrfs_abort_transaction,
      otherwise we hang and make me super grumpy.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65d4f4c1
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and read repair · b5de8d0d
      Filipe Manana authored
      While we are finishing a device replace operation we can have a concurrent
      task trying to do a read repair operation, in which case it will call
      btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
      points to the source device of the device replace operation. This allows
      for the read repair task to dereference the stripe's device pointer after
      the device replace operation has freed the source device, resulting in
      an invalid memory access. This is similar to the problem solved by my
      previous patch in the same series and named "Btrfs: fix race between
      device replace and discard".
      
      So fix this by surrounding the call to btrfs_map_block() and the code
      that uses the returned struct btrfs_bio with calls to
      btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
      proper serialization with the finishing phase of the device replace
      operation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      b5de8d0d
  8. 30 May, 2016 7 commits
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and discard · 2999241d
      Filipe Manana authored
      While we are finishing a device replace operation, we can make a discard
      operation (fs mounted with -o discard) do an invalid memory access like
      the one reported by the following trace:
      
      [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
      [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
      processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
      virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
      [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
      [ 3206.388595] RIP: 0010:[<ffffffff8124d233>]  [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
      [ 3206.388595] RSP: 0018:ffff880171b9bb80  EFLAGS: 00010246
      [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
      [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
      [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
      [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
      [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
      [ 3206.388595] FS:  00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
      [ 3206.388595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
      [ 3206.388595] Stack:
      [ 3206.388595]  0000000000004878 0000000000000000 0000000002400040 0000000000000000
      [ 3206.388595]  0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
      [ 3206.388595]  ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
      [ 3206.388595] Call Trace:
      [ 3206.388595]  [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
      [ 3206.388595]  [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
      [ 3206.388595]  [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
      [ 3206.388595]  [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
      [ 3206.388595]  [<ffffffff811a8417>] do_fsync+0x31/0x4a
      [ 3206.388595]  [<ffffffff811a8637>] SyS_fsync+0x10/0x14
      [ 3206.388595]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 3206.388595]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
      [ 3206.388595]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
      
      This happens because when we call btrfs_map_block() from
      btrfs_discard_extent() to get a btrfs_bio structure, the device replace
      operation has not finished yet, but before we use the device of one of the
      stripes from the returned btrfs_bio structure, the device object is freed.
      
      This is illustrated by the following diagram.
      
                  CPU 1                                                  CPU 2
      
       btrfs_dev_replace_start()
      
       (...)
      
       btrfs_dev_replace_finishing()
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
         (...)
      
                                                                  btrfs_sync_file()
                                                                    btrfs_start_transaction()
      
                                                                    (...)
      
                                                                    btrfs_commit_transaction()
                                                                      btrfs_finish_extent_commit()
                                                                        btrfs_discard_extent()
                                                                          btrfs_map_block()
                                                                            --> returns a struct btrfs_bio
                                                                                with a stripe that has a
                                                                                device field pointing to
                                                                                source device of the replace
                                                                                operation (the device that
                                                                                is being replaced)
      
         mutex_lock(&uuid_mutex)
         mutex_lock(&fs_info->fs_devices->device_list_mutex)
         mutex_lock(&fs_info->chunk_mutex)
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates the mapping tree and for each
               extent map that has a stripe pointing to
               the source device, it updates the stripe
               to point to the target device instead
      
         btrfs_rm_dev_replace_blocked()
           --> waits for fs_info->bio_counter to go down to 0
      
         btrfs_rm_dev_replace_remove_srcdev()
           --> removes source device from the list of devices
      
         mutex_unlock(&fs_info->chunk_mutex)
         mutex_unlock(&fs_info->fs_devices->device_list_mutex)
         mutex_unlock(&uuid_mutex)
      
         btrfs_rm_dev_replace_free_srcdev()
           --> frees the source device
      
                                                                          --> iterates over all stripes
                                                                              of the returned struct
                                                                              btrfs_bio
                                                                          --> for each stripe it
                                                                              dereferences its device
                                                                              pointer
                                                                              --> it ends up finding a
                                                                                  pointer to the device
                                                                                  used as the source
                                                                                  device for the replace
                                                                                  operation and that was
                                                                                  already freed
      
      So fix this by surrounding the call to btrfs_map_block(), and the code
      that uses the returned struct btrfs_bio, with calls to
      btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
      the finishing phase of the device replace operation blocks until the
      the bio counter decreases to zero before it frees the source device.
      This is the same approach we do at btrfs_map_bio() for example.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      2999241d
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and chunk allocation · 22ab04e8
      Filipe Manana authored
      While iterating and copying extents from the source device, the device
      replace code keeps adjusting a left cursor that is used to make sure that
      once we finish processing a device extent, any future writes to extents
      from the corresponding block group will get into both the source and
      target devices. This left cursor is also used for resuming the device
      replace operation at mount time.
      
      However using this left cursor to decide whether writes go into both
      devices or only the source device is not enough to guarantee we don't
      miss copying extents into the target device. There are two cases where
      the current approach fails. The first one is related to when there are
      holes in the device and they get allocated for new block groups while
      the device replace operation is iterating the device extents (more on
      this explained below). The second one is that when that loop over the
      device extents finishes, we start dellaloc, wait for all ordered extents
      and then commit the current transaction, we might have got new block
      groups allocated that are now using a device extent that has an offset
      greater then or equals to the value of the left cursor, in which case
      writes to extents belonging to these new block groups will get issued
      only to the source device.
      
      For the first case where the current approach of using a left cursor
      fails, consider the source device currently has the following layout:
      
        [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
        3Gb             4Gb                         5Gb
      
      While we are iterating the device extents from the source device using
      the commit root of the device tree, the following happens:
      
              CPU 1                                            CPU 2
      
                            <we are at transaction N>
      
        scrub_enumerate_chunks()
          --> searches the device tree for
              extents belonging to the source
              device using the device tree's
              commit root
          --> 1st iteration finds extent belonging to
              block group A
      
              --> sets block group A to RO mode
                  (btrfs_inc_block_group_ro)
      
              --> sets cursor left to found_key.offset
                  which is 3Gb
      
              --> scrub_chunk() starts
                  copies all allocated extents from
                  block group's A stripe at source
                  device into target device
      
                                                                 btrfs_alloc_chunk()
                                                                   --> allocates device extent
                                                                       in the range [4Gb, 5Gb[
                                                                       from the source device for
                                                                       a new block group C
      
                                                                 extent allocated from block
                                                                 group C for a direct IO,
                                                                 buffered write or btree node/leaf
      
                                                                 extent is written to, perhaps
                                                                 in response to a writepages()
                                                                 call from the VM or directly
                                                                 through direct IO
      
                                                                 the write is made only against
                                                                 the source device and not against
                                                                 the target device because the
                                                                 extent's offset is in the interval
                                                                 [4Gb, 5Gb[ which is larger then
                                                                 the value of cursor_left (3Gb)
      
              --> scrub_chunks() finishes
      
              --> updates left cursor from 3Gb to
                  4Gb
      
              --> btrfs_dec_block_group_ro() sets
                  block group A back to RW mode
      
                                   <we are still at transaction N>
      
          --> 2nd iteration finds extent belonging to
              block group B - it did not find the new
              extent in the range [4Gb, 5Gb[ for block
              group C because we are using the device
              tree's commit root or even because the
              block group's items are not all yet
              inserted in the respective btrees, that is,
              the block group is still attached to some
              transaction handle's new_bgs list and
              btrfs_create_pending_block_groups() was
              not called yet against that transaction
              handle, so the device extent items were
              not yet inserted into the devices tree
      
                                   <we are still at transaction N>
      
              --> so we end not copying anything from the newly
                  allocated device extent from the source device
                  to the target device
      
      So fix this by making __btrfs_map_block() always redirect writes to the
      target device as well, independently of the left cursor's value. With
      this change the left cursor is now used only for the purpose of tracking
      progress and allow a mount operation to resume a device replace.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      22ab04e8
    • Filipe Manana's avatar
      Btrfs: fix race setting block group back to RW mode during device replace · 1a1a8b73
      Filipe Manana authored
      After it finishes processing a device extent, the device replace code sets
      back the block group to RW mode and then after that it sets the left cursor
      to match the logical end address of the block group, so that future writes
      into extents belonging to the block group go both the source (old) and
      target (new) devices. However from the moment we turn the block group
      back to RW mode we have a short time window, that lasts until we update
      the left cursor's value, where extents can be allocated from the block
      group and written to, in which case they will not be copied/written to
      the target (new) device. Fix this by updating the left cursor's value
      before turning the block group back to RW mode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      1a1a8b73
    • Filipe Manana's avatar
      Btrfs: fix unprotected assignment of the left cursor for device replace · 81e87a73
      Filipe Manana authored
      We were assigning new values to fields of the device replace object
      without holding the respective lock after processing each device extent.
      This is important for the left cursor field which can be accessed by a
      concurrent task running __btrfs_map_block (which, correctly, takes the
      device replace lock).
      So change these fields while holding the device replace lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      81e87a73
    • Filipe Manana's avatar
      Btrfs: fix race setting block group readonly during device replace · f0e9b7d6
      Filipe Manana authored
      When we do a device replace, for each device extent we find from the
      source device, we set the corresponding block group to readonly mode to
      prevent writes into it from happening while we are copying the device
      extent from the source to the target device. However just before we set
      the block group to readonly mode some concurrent task might have already
      allocated an extent from it or decided it could perform a nocow write
      into one of its extents, which can make the device replace process to
      miss copying an extent since it uses the extent tree's commit root to
      search for extents and only once it finishes searching for all extents
      belonging to the block group it does set the left cursor to the logical
      end address of the block group - this is a problem if the respective
      ordered extents finish while we are searching for extents using the
      extent tree's commit root and no transaction commit happens while we
      are iterating the tree, since it's the delayed references created by the
      ordered extents (when they complete) that insert the extent items into
      the extent tree (using the non-commit root of course).
      Example:
      
                CPU 1                                            CPU 2
      
       btrfs_dev_replace_start()
         btrfs_scrub_dev()
           scrub_enumerate_chunks()
             --> finds device extent belonging
                 to block group X
      
                                     <transaction N starts>
      
                                                            starts buffered write
                                                            against some inode
      
                                                            writepages is run against
                                                            that inode forcing dellaloc
                                                            to run
      
                                                            btrfs_writepages()
                                                              extent_writepages()
                                                                extent_write_cache_pages()
                                                                  __extent_writepage()
                                                                    writepage_delalloc()
                                                                      run_delalloc_range()
                                                                        cow_file_range()
                                                                          btrfs_reserve_extent()
                                                                            --> allocates an extent
                                                                                from block group X
                                                                                (which is not yet
                                                                                 in RO mode)
                                                                          btrfs_add_ordered_extent()
                                                                            --> creates ordered extent Y
                                                              flush_epd_write_bio()
                                                                --> bio against the extent from
                                                                    block group X is submitted
      
             btrfs_inc_block_group_ro(bg X)
               --> sets block group X to readonly
      
             scrub_chunk(bg X)
               scrub_stripe(device extent from srcdev)
                 --> keeps searching for extent items
                     belonging to the block group using
                     the extent tree's commit root
                 --> it never blocks due to
                     fs_info->scrub_pause_req as no
                     one tries to commit transaction N
                 --> copies all extents found from the
                     source device into the target device
                 --> finishes search loop
      
                                                              bio completes
      
                                                              ordered extent Y completes
                                                              and creates delayed data
                                                              reference which will add an
                                                              extent item to the extent
                                                              tree when run (typically
                                                              at transaction commit time)
      
                                                                --> so the task doing the
                                                                    scrub/device replace
                                                                    at CPU 1 misses this
                                                                    and does not copy this
                                                                    extent into the new/target
                                                                    device
      
             btrfs_dec_block_group_ro(bg X)
               --> turns block group X back to RW mode
      
             dev_replace->cursor_left is set to the
             logical end offset of block group X
      
      So fix this by waiting for all cow and nocow writes after setting a block
      group to readonly mode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      f0e9b7d6
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and block group removal · 57ba4cb8
      Filipe Manana authored
      When it's finishing, the device replace code iterates all extent maps
      representing block group and for each one that has a stripe that refers
      to the source device, it replaces its device with the target device.
      However when it replaces the source device with the target device it,
      the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
      only after its ID is changed to match the one from the source device.
      This leads to races with the chunk removal code that can temporarly see
      a device with an ID of 0ULL and then attempt to use that ID to remove
      items from the device tree and fail, causing a transaction abort:
      
      [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
      [ 9238.594377] ------------[ cut here ]------------
      [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
      [ 9238.594403] BTRFS: Transaction aborted (error 1)
      [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
      rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
      oppy [last unloaded: btrfs]
      [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
      [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 9238.594421]  0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
      [ 9238.594422]  0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
      [ 9238.594423]  0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
      [ 9238.594424] Call Trace:
      [ 9238.594428]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [ 9238.594430]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [ 9238.594432]  [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
      [ 9238.594434]  [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
      [ 9238.594450]  [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
      [ 9238.594452]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [ 9238.594464]  [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
      [ 9238.594476]  [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
      [ 9238.594489]  [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
      [ 9238.594490]  [<ffffffff8106f403>] kthread+0xd4/0xdc
      [ 9238.594494]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
      [ 9238.594495]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
      [ 9238.594496] ---[ end trace 183efbe50275f059 ]---
      
      The sequence of steps leading to this is like the following:
      
                    CPU 1                                           CPU 2
      
       btrfs_dev_replace_finishing()
      
         at this point
         dev_replace->tgtdev->devid ==
         BTRFS_DEV_REPLACE_DEVID (0ULL)
      
         ...
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
                                                           btrfs_delete_unused_bgs()
                                                             btrfs_remove_chunk()
      
                                                               looks up for the extent map
                                                               corresponding to the chunk
      
                                                               lock_chunks() (chunk_mutex)
                                                               check_system_chunk()
                                                               unlock_chunks() (chunk_mutex)
      
         locks fs_info->chunk_mutex
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates fs_info->mapping_tree and
               replaces the device in every extent
               map's map->stripes[] with
               dev_replace->tgtdev, which still has
               an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
      
                                                               iterates over all stripes from
                                                               the extent map
      
                                                                 --> calls btrfs_free_dev_extent()
                                                                     passing it the target device
                                                                     that still has an ID of 0ULL
      
                                                                 --> btrfs_free_dev_extent() fails
                                                                   --> aborts current transaction
      
         finishes setting up the target device,
         namely it sets tgtdev->devid to the value
         of srcdev->devid (which is necessarily > 0)
      
         frees the srcdev
      
         unlocks fs_info->chunk_mutex
      
      So fix this by taking the device list mutex while processing the stripes
      for the chunk's extent map. This is similar to the race between device
      replace and block group creation that was fixed by commit 50460e37
      ("Btrfs: fix race when finishing dev replace leading to transaction abort").
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      57ba4cb8
    • Filipe Manana's avatar
      Btrfs: fix race between readahead and device replace/removal · ce7791ff
      Filipe Manana authored
      The list of devices is protected by the device_list_mutex and the device
      replace code, in its finishing phase correctly takes that mutex before
      removing the source device from that list. However the readahead code was
      iterating that list without acquiring the respective mutex leading to
      crashes later on due to invalid memory accesses:
      
      [125671.831036] general protection fault: 0000 [#1] PREEMPT SMP
      [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport
      processor ser
      [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G        W       4.6.0-rc7-btrfs-next-29+ #1
      [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs]
      [125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000
      [125671.834973] RIP: 0010:[<ffffffff81270479>]  [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105
      [125671.834973] RSP: 0018:ffff8801ac91bc28  EFLAGS: 00010206
      [125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000
      [125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8
      [125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000
      [125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8
      [125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff
      [125671.834973] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
      [125671.834973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0
      [125671.834973] Stack:
      [125671.834973]  0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000
      [125671.834973]  0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000
      [125671.834973]  ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0
      [125671.834973] Call Trace:
      [125671.834973]  [<ffffffff81270541>] radix_tree_lookup+0xd/0xf
      [125671.834973]  [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs]
      [125671.834973]  [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs]
      [125671.834973]  [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs]
      [125671.834973]  [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs]
      [125671.834973]  [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs]
      [125671.834973]  [<ffffffff81069691>] process_one_work+0x271/0x4e9
      [125671.834973]  [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9
      [125671.834973]  [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3
      [125671.834973]  [<ffffffff8106f403>] kthread+0xd4/0xdc
      [125671.834973]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
      [125671.834973]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
      
      So fix this by taking the device_list_mutex in the readahead code. We
      can't use here the lighter approach of using a rcu_read_lock() and
      rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
      because we end up doing calls to sleeping functions (kzalloc()) in the
      respective code path.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      ce7791ff