1. 19 Jan, 2017 1 commit
    • Wang Xiaoguang's avatar
      btrfs: fix false enospc error when truncating heavily reflinked file · 47b5d646
      Wang Xiaoguang authored
      Below test script can reveal this bug:
          dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
          dev=$(losetup --show -f fs.img)
          mkdir -p /mnt/mntpoint
          mkfs.btrfs  -f $dev
          mount $dev /mnt/mntpoint
          cd /mnt/mntpoint
      
          echo "workdir is: /mnt/mntpoint"
          blocksize=$((128 * 1024))
          dd if=/dev/zero of=testfile bs=$blocksize count=1
          sync
          count=$((17*1024*1024*1024/blocksize))
          echo "file size is:" $((count*blocksize))
          for ((i = 1; i <= $count; i++)); do
              dst_offset=$((blocksize * i))
              xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
                      testfile > /dev/null
          done
          sync
          truncate --size 0 testfile
      
      The last truncate operation will fail for ENOSPC reason, but indeed
      it should not fail.
      
      In btrfs_truncate(), we use a temporary block_rsv to do truncate
      operation. With every btrfs_truncate_inode_items() call, we migrate space
      to this block_rsv, but forget to cleanup previous reservation, which
      will make this block_rsv's reserved bytes keep growing, and this reserved
      space will only be released in the end of btrfs_truncate(), this metadata
      leak will impact other's metadata reservation. In this case, it's
      "btrfs_start_transaction(root, 2);" fails for enospc error, which make
      this truncate operation fail.
      
      Call btrfs_block_rsv_release() to fix this bug.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47b5d646
  2. 11 Jan, 2017 1 commit
  3. 09 Jan, 2017 4 commits
  4. 04 Jan, 2017 1 commit
  5. 03 Jan, 2017 5 commits
  6. 19 Dec, 2016 1 commit
  7. 13 Dec, 2016 2 commits
    • Maxim Patlasov's avatar
      btrfs: limit async_work allocation and worker func duration · 2939e1a8
      Maxim Patlasov authored
      Problem statement: unprivileged user who has read-write access to more than
      one btrfs subvolume may easily consume all kernel memory (eventually
      triggering oom-killer).
      
      Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):
      
      [root@kteam1 ~]# cat prep.sh
      
      DEV=/dev/sdb
      mkfs.btrfs -f $DEV
      mount $DEV /mnt
      for i in `seq 1 16`
      do
      	mkdir /mnt/$i
      	btrfs subvolume create /mnt/SV_$i
      	ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
      	mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
      	chmod a+rwx /mnt/$i
      done
      
      [root@kteam1 ~]# sh prep.sh
      
      [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done
      
      [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
      kmalloc-128        10144  10144    128   32    1 : tunables    0    0    0 : slabdata    317    317      0
      kmalloc-128       9992352 9992352    128   32    1 : tunables    0    0    0 : slabdata 312261 312261      0
      kmalloc-128       24226752 24226752    128   32    1 : tunables    0    0    0 : slabdata 757086 757086      0
      kmalloc-128       42754240 42754240    128   32    1 : tunables    0    0    0 : slabdata 1336070 1336070      0
      
      The huge numbers above come from insane number of async_work-s allocated
      and queued by btrfs_wq_run_delayed_node.
      
      The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
      works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
      worker func (btrfs_async_run_delayed_root) processes at least
      BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
      works as expected while the list is almost empty. As soon as it is getting
      bigger, worker func starts to process more than one item at a time, it takes
      longer, and the chances to have async_works queued more than needed is getting
      higher.
      
      The problem above is worsened by another flaw of delayed-inode implementation:
      if async_work was queued in a throttling branch (number of items >=
      BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
      the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
      the func occupies CPU infinitely (up to 30sec in my experiments): while the
      func is trying to drain the list, the user activity may add more and more
      items to the list.
      
      The patch fixes both problems in straightforward way: refuse queuing too
      many works in btrfs_wq_run_delayed_node and bail out of worker func if
      at least BTRFS_DELAYED_WRITEBACK items are processed.
      
      Changed in v2: remove support of thresh == NO_THRESHOLD.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v3.15+
      2939e1a8
    • Chris Mason's avatar
      Merge branch 'for-chris-4.10' of... · 5f52a2c5
      Chris Mason authored
      Merge branch 'for-chris-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.10
      
      Patches queued up by Filipe:
      
      The most important change is still the fix for the extent tree
      corruption that happens due to balance when qgroups are enabled (a
      regression introduced in 4.7 by a fix for a regression from the last
      qgroups rework). This has been hitting SLE and openSUSE users and QA
      very badly, where transactions keep getting aborted when running
      delayed references leaving the root filesystem in RO mode and nearly
      unusable.  There are fixes here that allow us to run xfstests again
      with the integrity checker enabled, which has been impossible since 4.8
      (apparently I'm the only one running xfstests with the integrity
      checker enabled, which is useful to validate dirtied leafs, like
      checking if there are keys out of order, etc).  The rest are just some
      trivial fixes, most of them tagged for stable, and two cleanups.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5f52a2c5
  8. 11 Dec, 2016 1 commit
  9. 09 Dec, 2016 1 commit
  10. 06 Dec, 2016 20 commits
  11. 30 Nov, 2016 3 commits
    • Robbie Ko's avatar
      Btrfs: fix tree search logic when replaying directory entry deletes · 2a7bf53f
      Robbie Ko authored
      If a log tree has a layout like the following:
      
      leaf N:
              ...
              item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8
                      dir log end 1275809046
      leaf N + 1:
              item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8
                      dir log end 18446744073709551615
              ...
      
      When we pass the value 1275809046 + 1 as the parameter start_ret to the
      function tree-log.c:find_dir_range() (done by replay_dir_deletes()), we
      end up with path->slots[0] having the value 239 (points to the last item
      of leaf N, item 240). Because the dir log item in that position has an
      offset value smaller than *start_ret (1275809046 + 1) we need to move on
      to the next leaf, however the logic for that is wrong since it compares
      the current slot to the number of items in the leaf, which is smaller
      and therefore we don't lookup for the next leaf but instead we set the
      slot to point to an item that does not exist, at slot 240, and we later
      operate on that slot which has unexpected content or in the worst case
      can result in an invalid memory access (accessing beyond the last page
      of leaf N's extent buffer).
      
      So fix the logic that checks when we need to lookup at the next leaf
      by first incrementing the slot and only after to check if that slot
      is beyond the last item of the current leaf.
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: e02119d5 (Btrfs: Add a write ahead tree log to optimize synchronous operations)
      Cc: stable@vger.kernel.org  # 2.6.29+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Modified changelog for clarity and correctness]
      2a7bf53f
    • Robbie Ko's avatar
      Btrfs: fix deadlock caused by fsync when logging directory entries · ec125cfb
      Robbie Ko authored
      While logging new directory entries, at tree-log.c:log_new_dir_dentries(),
      after we call btrfs_search_forward() we get a leaf with a read lock on it,
      and without unlocking that leaf we can end up calling btrfs_iget() to get
      an inode pointer. The later (btrfs_iget()) can end up doing a read-only
      search on the same tree again, if the inode is not in memory already, which
      ends up causing a deadlock if some other task in the meanwhile started a
      write search on the tree and is attempting to write lock the same leaf
      that btrfs_search_forward() locked while holding write locks on upper
      levels of the tree blocking the read search from btrfs_iget(). In this
      scenario we get a deadlock.
      
      So fix this by releasing the search path before calling btrfs_iget() at
      tree-log.c:log_new_dir_dentries().
      
      Example trace of such deadlock:
      
      [ 4077.478852] kworker/u24:10  D ffff88107fc90640     0 14431      2 0x00000000
      [ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
      [ 4077.494346]  ffff880ffa56bad0 0000000000000046 0000000000009000 ffff880ffa56bfd8
      [ 4077.502629]  ffff880ffa56bfd8 ffff881016ce21c0 ffffffffa06ecb26 ffff88101a5d6138
      [ 4077.510915]  ffff880ebb5173b0 ffff880ffa56baf8 ffff880ebb517410 ffff881016ce21c0
      [ 4077.519202] Call Trace:
      [ 4077.528752]  [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs]
      [ 4077.536049]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
      [ 4077.542574]  [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
      [ 4077.550171]  [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs]
      [ 4077.558252]  [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs]
      [ 4077.566140]  [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs]
      [ 4077.573928]  [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs]
      [ 4077.582399]  [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs]
      [ 4077.589896]  [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
      [ 4077.599632]  [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs]
      [ 4077.607134]  [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs]
      [ 4077.615329]  [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0
      [ 4077.622043]  [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0
      [ 4077.628459]  [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270
      [ 4077.635759]  [<ffffffff81052b0f>] ? kthread+0xaf/0xc0
      [ 4077.641404]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110
      [ 4077.648696]  [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90
      [ 4077.654926]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110
      
      [ 4078.358087] kworker/u24:15  D ffff88107fcd0640     0 14436      2 0x00000000
      [ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
      [ 4078.373574]  ffff880ffa57fad0 0000000000000046 0000000000009000 ffff880ffa57ffd8
      [ 4078.381864]  ffff880ffa57ffd8 ffff88103004d0a0 ffffffffa06ecb26 ffff88101a5d6138
      [ 4078.390163]  ffff880fbeffc298 ffff880ffa57faf8 ffff880fbeffc2f8 ffff88103004d0a0
      [ 4078.398466] Call Trace:
      [ 4078.408019]  [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs]
      [ 4078.415322]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
      [ 4078.421844]  [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
      [ 4078.429438]  [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs]
      [ 4078.437518]  [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs]
      [ 4078.445404]  [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs]
      [ 4078.453194]  [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs]
      [ 4078.461663]  [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs]
      [ 4078.469161]  [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
      [ 4078.478893]  [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs]
      [ 4078.486388]  [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs]
      [ 4078.494561]  [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0
      [ 4078.501278]  [<ffffffff8104a507>] ? pwq_activate_delayed_work+0x27/0x40
      [ 4078.508673]  [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0
      [ 4078.515098]  [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270
      [ 4078.522396]  [<ffffffff81052b0f>] ? kthread+0xaf/0xc0
      [ 4078.528032]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110
      [ 4078.535325]  [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90
      [ 4078.541552]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110
      
      [ 4079.355824] user-space-program D ffff88107fd30640     0 32020      1 0x00000000
      [ 4079.363716]  ffff880eae8eba10 0000000000000086 0000000000009000 ffff880eae8ebfd8
      [ 4079.372003]  ffff880eae8ebfd8 ffff881016c162c0 ffffffffa06ecb26 ffff88101a5d6138
      [ 4079.380294]  ffff880fbed4b4c8 ffff880eae8eba38 ffff880fbed4b528 ffff881016c162c0
      [ 4079.388586] Call Trace:
      [ 4079.398134]  [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs]
      [ 4079.405431]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
      [ 4079.411955]  [<ffffffffa06876fb>] ? btrfs_lock_root_node+0x2b/0x40 [btrfs]
      [ 4079.419644]  [<ffffffffa068ce83>] ? btrfs_search_slot+0xa03/0xb10 [btrfs]
      [ 4079.427237]  [<ffffffffa06aba52>] ? btrfs_buffer_uptodate+0x52/0x70 [btrfs]
      [ 4079.435041]  [<ffffffffa0689b60>] ? generic_bin_search.constprop.38+0x80/0x190 [btrfs]
      [ 4079.443897]  [<ffffffffa068ea44>] ? btrfs_insert_empty_items+0x74/0xd0 [btrfs]
      [ 4079.451975]  [<ffffffffa072c443>] ? copy_items+0x128/0x850 [btrfs]
      [ 4079.458890]  [<ffffffffa072da10>] ? btrfs_log_inode+0x629/0xbf3 [btrfs]
      [ 4079.466292]  [<ffffffffa06f34a1>] ? btrfs_log_inode_parent+0xc61/0xf30 [btrfs]
      [ 4079.474373]  [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs]
      [ 4079.482161]  [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs]
      [ 4079.489558]  [<ffffffff8112777c>] ? do_fsync+0x4c/0x80
      [ 4079.495300]  [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10
      [ 4079.501422]  [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b
      
      [ 4079.508334] user-space-program D ffff88107fc30640     0 32021      1 0x00000004
      [ 4079.516226]  ffff880eae8efbf8 0000000000000086 0000000000009000 ffff880eae8effd8
      [ 4079.524513]  ffff880eae8effd8 ffff881030279610 ffffffffa06ecb26 ffff88101a5d6138
      [ 4079.532802]  ffff880ebb671d88 ffff880eae8efc20 ffff880ebb671de8 ffff881030279610
      [ 4079.541092] Call Trace:
      [ 4079.550642]  [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs]
      [ 4079.557941]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
      [ 4079.564463]  [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
      [ 4079.572058]  [<ffffffffa06bb7d8>] ? btrfs_truncate_inode_items+0x168/0xb90 [btrfs]
      [ 4079.580526]  [<ffffffffa06b04be>] ? join_transaction.isra.15+0x1e/0x3a0 [btrfs]
      [ 4079.588701]  [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs]
      [ 4079.596196]  [<ffffffffa0690ac6>] ? block_rsv_add_bytes+0x16/0x50 [btrfs]
      [ 4079.603789]  [<ffffffffa06bc2e9>] ? btrfs_truncate+0xe9/0x2e0 [btrfs]
      [ 4079.610994]  [<ffffffffa06bd00b>] ? btrfs_setattr+0x30b/0x410 [btrfs]
      [ 4079.618197]  [<ffffffff81117c1c>] ? notify_change+0x1dc/0x680
      [ 4079.624625]  [<ffffffff8123c8a4>] ? aa_path_perm+0xd4/0x160
      [ 4079.630854]  [<ffffffff810f4fcb>] ? do_truncate+0x5b/0x90
      [ 4079.636889]  [<ffffffff810f59fa>] ? do_sys_ftruncate.constprop.15+0x10a/0x160
      [ 4079.644869]  [<ffffffff8110d87b>] ? SyS_fcntl+0x5b/0x570
      [ 4079.650805]  [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b
      
      [ 4080.410607] user-space-program D ffff88107fc70640     0 32028  12639 0x00000004
      [ 4080.418489]  ffff880eaeccbbe0 0000000000000086 0000000000009000 ffff880eaeccbfd8
      [ 4080.426778]  ffff880eaeccbfd8 ffff880f317ef1e0 ffffffffa06ecb26 ffff88101a5d6138
      [ 4080.435067]  ffff880ef7e93928 ffff880f317ef1e0 ffff880eaeccbc08 ffff880f317ef1e0
      [ 4080.443353] Call Trace:
      [ 4080.452920]  [<ffffffffa06ed15d>] ? btrfs_tree_read_lock+0xdd/0x190 [btrfs]
      [ 4080.460703]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
      [ 4080.467225]  [<ffffffffa06876bb>] ? btrfs_read_lock_root_node+0x2b/0x40 [btrfs]
      [ 4080.475400]  [<ffffffffa068cc81>] ? btrfs_search_slot+0x801/0xb10 [btrfs]
      [ 4080.482994]  [<ffffffffa06b2df0>] ? btrfs_clean_one_deleted_snapshot+0xe0/0xe0 [btrfs]
      [ 4080.491857]  [<ffffffffa06a70a6>] ? btrfs_lookup_inode+0x26/0x90 [btrfs]
      [ 4080.499353]  [<ffffffff810ec42f>] ? kmem_cache_alloc+0xaf/0xc0
      [ 4080.505879]  [<ffffffffa06bd905>] ? btrfs_iget+0xd5/0x5d0 [btrfs]
      [ 4080.512696]  [<ffffffffa06caf04>] ? btrfs_get_token_64+0x104/0x120 [btrfs]
      [ 4080.520387]  [<ffffffffa06f341f>] ? btrfs_log_inode_parent+0xbdf/0xf30 [btrfs]
      [ 4080.528469]  [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs]
      [ 4080.536258]  [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs]
      [ 4080.543657]  [<ffffffff8112777c>] ? do_fsync+0x4c/0x80
      [ 4080.549399]  [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10
      [ 4080.555534]  [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 2f2ff0ee (Btrfs: fix metadata inconsistencies after directory fsync)
      Cc: stable@vger.kernel.org # 4.1+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Modified changelog for clarity and correctness]
      ec125cfb
    • Robbie Ko's avatar
      Btrfs: fix enospc in hole punching · 2cdaf447
      Robbie Ko authored
      The hole punching can result in adding new leafs (and as a consequence
      new nodes) to the tree because when we find file extent items that span
      beyond the hole range we may end up not deleting them (just adjusting
      them, reducing their range by reducing their length or increasing their
      offset field) and add new file extent items representing holes.
      
      So after splitting a leaf (therefore creating a new one) to insert a new
      file extent item representing a hole, a new node might be added to each
      level of the tree in the worst case scenario (since there's a new key
      and every parent node was full).
      
      For example if a file has an extent item representing the range 0 to 64Mb
      and we punch a hole in the range 1Mb to 20Mb, the existing extent item is
      duplicated and one of the copies is adjusted to represent the range 0 to
      1Mb, the other copy adjusted to represent the range 20Mb to 64Mb, and a
      new file extent item representing a hole in the range 1Mb to 20Mb is
      inserted.
      
      Fix this by using btrfs_calc_trans_metadata_size() instead of
      btrfs_calc_trunc_metadata_size(), so that enough metadata space is
      reserved for the worst possible case.
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Modified changelog for clarity and correctness]
      2cdaf447