An error occurred fetching the project authors.
  1. 09 Jul, 2020 1 commit
  2. 20 Jun, 2020 1 commit
  3. 04 Jan, 2020 2 commits
  4. 21 Dec, 2019 1 commit
    • Josef Bacik's avatar
      btrfs: record all roots for rename exchange on a subvol · b2d65356
      Josef Bacik authored
      commit 3e174099 upstream.
      
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2d65356
  5. 21 Nov, 2018 1 commit
    • Robbie Ko's avatar
      Btrfs: fix cur_offset in the error case for nocow · 3fe6b9aa
      Robbie Ko authored
      commit 506481b2 upstream.
      
      When the cow_file_range fails, the related resources are unlocked
      according to the range [start..end), so the unlock cannot be repeated in
      run_delalloc_nocow.
      
      In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
      not updated correctly, so move the cur_offset update before
      cow_file_range.
      
        kernel BUG at mm/page-writeback.c:2663!
        Internal error: Oops - BUG: 0 [#1] SMP
        CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
        Hardware name: Realtek_RTD1296 (DT)
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
        PC is at clear_page_dirty_for_io+0x1bc/0x1e8
        LR is at clear_page_dirty_for_io+0x14/0x1e8
        pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
        sp : ffffffc02e9af4f0
        Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
        Call trace:
        [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
        [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
        [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
        [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
        [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
        [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
        [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
        [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
        [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
        [<ffffffc00033d758>] do_writepages+0x30/0x88
        [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
        [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
        [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
        [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
        [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
        [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
        [<ffffffc0002b02f0>] worker_thread+0x130/0x500
        [<ffffffc0002b6344>] kthread+0x10c/0x110
        [<ffffffc000284590>] ret_from_fork+0x10/0x40
        Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fe6b9aa
  6. 13 Nov, 2018 1 commit
  7. 10 Nov, 2018 1 commit
  8. 03 Jul, 2018 2 commits
  9. 30 May, 2018 1 commit
    • Al Viro's avatar
      do d_instantiate/unlock_new_inode combinations safely · 2d2d3f1e
      Al Viro authored
      commit 1e2e547a upstream.
      
      For anything NFS-exported we do _not_ want to unlock new inode
      before it has grown an alias; original set of fixes got the
      ordering right, but missed the nasty complication in case of
      lockdep being enabled - unlock_new_inode() does
      	lockdep_annotate_inode_mutex_key(inode)
      which can only be done before anyone gets a chance to touch
      ->i_mutex.  Unfortunately, flipping the order and doing
      unlock_new_inode() before d_instantiate() opens a window when
      mkdir can race with open-by-fhandle on a guessed fhandle, leading
      to multiple aliases for a directory inode and all the breakage
      that follows from that.
      
      	Correct solution: a new primitive (d_instantiate_new())
      combining these two in the right order - lockdep annotate, then
      d_instantiate(), then the rest of unlock_new_inode().  All
      combinations of d_instantiate() with unlock_new_inode() should
      be converted to that.
      
      Cc: stable@kernel.org	# 2.6.29 and later
      Tested-by: default avatarMike Marshall <hubcap@omnibond.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d2d3f1e
  10. 24 Mar, 2018 1 commit
    • Filipe Manana's avatar
      Btrfs: fix incorrect space accounting after failure to insert inline extent · fd35ded5
      Filipe Manana authored
      [ Upstream commit 1c81ba23 ]
      
      When using compression, if we fail to insert an inline extent we
      incorrectly end up attempting to free the reserved data space twice,
      once through extent_clear_unlock_delalloc(), because we pass it the
      flag EXTENT_DO_ACCOUNTING, and once through a direct call to
      btrfs_free_reserved_data_space_noquota(). This results in a trace
      like the following:
      
      [  834.576240] ------------[ cut here ]------------
      [  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
      [  834.596103] Call Trace:
      [  834.596103]  dump_stack+0x67/0x90
      [  834.596103]  __warn+0xc2/0xdd
      [  834.596103]  warn_slowpath_null+0x1d/0x1f
      [  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
      [  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
      [  834.596103]  async_cow_start+0x32/0x4d [btrfs]
      [  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
      [  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
      [  834.596103]  process_one_work+0x273/0x4e4
      [  834.596103]  worker_thread+0x1eb/0x2ca
      [  834.596103]  ? rescuer_thread+0x2b6/0x2b6
      [  834.596103]  kthread+0x100/0x108
      [  834.596103]  ? __list_del_entry+0x22/0x22
      [  834.596103]  ret_from_fork+0x2e/0x40
      [  834.611656] ---[ end trace 719902fe6bdef08f ]---
      
      So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
      if an error happened.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd35ded5
  11. 22 Feb, 2018 2 commits
  12. 17 Feb, 2018 1 commit
  13. 20 Dec, 2017 1 commit
    • Zygo Blaxell's avatar
      btrfs: add missing memset while reading compressed inline extents · 8f60ef94
      Zygo Blaxell authored
      [ Upstream commit e1699d2d ]
      
      This is a story about 4 distinct (and very old) btrfs bugs.
      
      Commit c8b97818 ("Btrfs: Add zlib compression support") added
      three data corruption bugs for inline extents (bugs #1-3).
      
      Commit 93c82d57 ("Btrfs: zero page past end of inline file items")
      fixed bug #1:  uncompressed inline extents followed by a hole and more
      extents could get non-zero data in the hole as they were read.  The fix
      was to add a memset in btrfs_get_extent to zero out the hole.
      
      Commit 166ae5a4 ("btrfs: fix inline compressed read err corruption")
      fixed bug #2:  compressed inline extents which contained non-zero bytes
      might be replaced with zero bytes in some cases.  This patch removed an
      unhelpful memset from uncompress_inline, but the case where memset is
      required was missed.
      
      There is also a memset in the decompression code, but this only covers
      decompressed data that is shorter than the ram_bytes from the extent
      ref record.  This memset doesn't cover the region between the end of the
      decompressed data and the end of the page.  It has also moved around a
      few times over the years, so there's no single patch to refer to.
      
      This patch fixes bug #3:  compressed inline extents followed by a hole
      and more extents could get non-zero data in the hole as they were read
      (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
      The fix is the same:  zero out the hole in the compressed case too,
      by putting a memset back in uncompress_inline, but this time with
      correct parameters.
      
      The last and oldest bug, bug #0, is the cause of the offending inline
      extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
      of behavior somewhere in the btrfs write code.  In a few special cases,
      an inline extent and hole are allowed to persist where they normally
      would be combined with later extents in the file.
      
      A fast reproducer for bug #0 is presented below.  A few offending extents
      are also created in the wild during large rsync transfers with the -S
      flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
      will produce a handful of offending files as well.  Once an offending
      file is created, it can present different content to userspace each
      time it is read.
      
      Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
      kernel back to v3.5 has this behavior.  There are fossil records of this
      bug's effects in commits all the way back to v2.6.32.  I have no reason
      to believe bug #0 wasn't present at the beginning of btrfs compression
      support in v2.6.29, but I can't easily test kernels that old to be sure.
      
      It is not clear whether bug #0 is worth fixing.  A fix would likely
      require injecting extra reads into currently write-only paths, and most
      of the exceptional cases caused by bug #0 are already handled now.
      
      Whether we like them or not, bug #0's inline extents followed by holes
      are part of the btrfs de-facto disk format now, and we need to be able
      to read them without data corruption or an infoleak.  So enough about
      bug #0, let's get back to bug #3 (this patch).
      
      An example of on-disk structure leading to data corruption found in
      the wild:
      
              item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                      inode generation 50 transid 50 size 47424 nbytes 49141
                      block group 0 mode 100644 links 1 uid 0 gid 0
                      rdev 0 flags 0x0(none)
              item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                      inode ref index 3 namelen 10 name: DB_File.so
              item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                      inline extent data size 1341 ram 4085 compress(zlib)
              item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                      extent data disk byte 5367308288 nr 20480
                      extent data offset 0 nr 45056 ram 45056
                      extent compression(zlib)
      
      Different data appears in userspace during each read of the 11 bytes
      between 4085 and 4096.  The extent in item 63 is not long enough to
      fill the first page of the file, so a memset is required to fill the
      space between item 63 (ending at 4085) and item 64 (beginning at 4096)
      with zero.
      
      Here is a reproducer from Liu Bo, which demonstrates another method
      of creating the same inline extent and hole pattern:
      
      Using 'page_poison=on' kernel command line (or enable
      CONFIG_PAGE_POISONING) run the following:
      
      	# touch foo
      	# chattr +c foo
      	# xfs_io -f -c "pwrite -W 0 1000" foo
      	# xfs_io -f -c "falloc 4 8188" foo
      	# od -x foo
      	# echo 3 >/proc/sys/vm/drop_caches
      	# od -x foo
      
      This produce the following on my box:
      
      Correct output:  file contains 1000 data bytes followed
      by zeros:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
      	0001760 0000 0000 0000 0000 0000 0000 0000 0000
      	*
      	0020000
      
      Actual output:  the data after the first 1000 bytes
      will be different each run:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
      	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
      	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
      	(...)
      Signed-off-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f60ef94
  14. 08 Oct, 2017 1 commit
  15. 07 Aug, 2017 1 commit
  16. 05 Jul, 2017 2 commits
    • Liu Bo's avatar
      Btrfs: fix truncate down when no_holes feature is enabled · c3eab85f
      Liu Bo authored
      [ Upstream commit 91298eec ]
      
      For such a file mapping,
      
      [0-4k][hole][8k-12k]
      
      In NO_HOLES mode, we don't have the [hole] extent any more.
      Commit c1aa4575 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
       fixed disk isize not being updated in NO_HOLES mode when data is not flushed.
      
      However, even if data has been flushed, we can still have trouble
      in updating disk isize since we updated disk isize to 'start' of
      the last evicted extent.
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3eab85f
    • Chandan Rajendra's avatar
      Btrfs: Fix deadlock between direct IO and fast fsync · e8b5068b
      Chandan Rajendra authored
      [ Upstream commit 97dcdea0 ]
      
      The following deadlock is seen when executing generic/113 test,
      
       ---------------------------------------------------------+----------------------------------------------------
        Direct I/O task                                           Fast fsync task
       ---------------------------------------------------------+----------------------------------------------------
        btrfs_direct_IO
          __blockdev_direct_IO
           do_blockdev_direct_IO
            do_direct_IO
             btrfs_get_blocks_direct
              while (blocks needs to written)
               get_more_blocks (first iteration)
                btrfs_get_blocks_direct
                 btrfs_create_dio_extent
                   down_read(&BTRFS_I(inode) >dio_sem)
                   Create and add extent map and ordered extent
                   up_read(&BTRFS_I(inode) >dio_sem)
                                                                  btrfs_sync_file
                                                                    btrfs_log_dentry_safe
                                                                     btrfs_log_inode_parent
                                                                      btrfs_log_inode
                                                                       btrfs_log_changed_extents
                                                                        down_write(&BTRFS_I(inode) >dio_sem)
                                                                         Collect new extent maps and ordered extents
                                                                          wait for ordered extent completion
               get_more_blocks (second iteration)
                btrfs_get_blocks_direct
                 btrfs_create_dio_extent
                   down_read(&BTRFS_I(inode) >dio_sem)
       --------------------------------------------------------------------------------------------------------------
      
      In the above description, Btrfs direct I/O code path has not yet started
      submitting bios for file range covered by the initial ordered
      extent. Meanwhile, The fast fsync task obtains the write semaphore and
      waits for I/O on the ordered extent to get completed. However, the
      Direct I/O task is now blocked on obtaining the read semaphore.
      
      To resolve the deadlock, this commit modifies the Direct I/O code path
      to obtain the read semaphore before invoking
      __blockdev_direct_IO(). The semaphore is then given up after
      __blockdev_direct_IO() returns. This allows the Direct I/O code to
      complete I/O on all the ordered extents it creates.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8b5068b
  17. 14 Jun, 2017 1 commit
    • David Sterba's avatar
      btrfs: use correct types for page indices in btrfs_page_exists_in_range · 4d15ab90
      David Sterba authored
      commit cc2b702c upstream.
      
      Variables start_idx and end_idx are supposed to hold a page index
      derived from the file offsets. The int type is not the right one though,
      offsets larger than 1 << 44 will get silently trimmed off the high bits.
      (1 << 44 is 16TiB)
      
      What can go wrong, if start is below the boundary and end gets trimmed:
      - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
      - the final check "if (page->index <= end_idx)" will unexpectedly fail
      
      The function will return false, ie. "there's no page in the range",
      although there is at least one.
      
      btrfs_page_exists_in_range is used to prevent races in:
      
      * in hole punching, where we make sure there are not pages in the
        truncated range, otherwise we'll wait for them to finish and redo
        truncation, but we're going to replace the pages with holes anyway so
        the only problem is the intermediate state
      
      * lock_extent_direct: we want to make sure there are no pages before we
        lock and start DIO, to prevent stale data reads
      
      For practical occurence of the bug, there are several constaints.  The
      file must be quite large, the affected range must cross the 16TiB
      boundary and the internal state of the file pages and pending operations
      must match.  Also, we must not have started any ordered data in the
      range, otherwise we don't even reach the buggy function check.
      
      DIO locking tries hard in several places to avoid deadlocks with
      buffered IO and avoids waiting for ranges. The worst consequence seems
      to be stale data read.
      
      CC: Liu Bo <bo.li.liu@oracle.com>
      Fixes: fc4adbff ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d15ab90
  18. 01 Feb, 2017 3 commits
  19. 24 Oct, 2016 2 commits
    • Wang Xiaoguang's avatar
      btrfs: pass correct args to btrfs_async_run_delayed_refs() · dd4b857a
      Wang Xiaoguang authored
      In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
      swap the arg2 and arg3 wrongly, fix this.
      
      This bug just impacts asynchronous delayed refs handle when we truncate inodes.
      In delayed_ref_async_start(), there is such codes:
      
          trans = btrfs_join_transaction(async->root);
          if (trans->transid > async->transid)
              goto end;
          ret = btrfs_run_delayed_refs(trans, async->root, async->count);
      
      From this codes, we can see that this just influence whether can we handle
      delayed refs or the number of delayed refs to handle, this may impact
      performance, but will not result in missing delayed refs, all delayed refs will
      be handled in btrfs_commit_transaction().
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd4b857a
    • Goldwyn Rodrigues's avatar
      btrfs: qgroup: Prevent qgroup->reserved from going subzero · 0b34c261
      Goldwyn Rodrigues authored
      While free'ing qgroup->reserved resources, we much check if
      the page has not been invalidated by a truncate operation
      by checking if the page is still dirty before reducing the
      qgroup resources. Resources in such a case are free'd when
      the entire extent is released by delayed_ref.
      
      This fixes a double accounting while releasing resources
      in case of truncating a file, reproduced by the following testcase.
      
      SCRATCH_DEV=/dev/vdb
      SCRATCH_MNT=/mnt
      mkfs.btrfs -f $SCRATCH_DEV
      mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
      cd $SCRATCH_MNT
      btrfs quota enable $SCRATCH_MNT
      btrfs subvolume create a
      btrfs qgroup limit 500m a $SCRATCH_MNT
      sync
      for c in {1..15}; do
      dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
      done
      
      sleep 10
      sync
      sleep 5
      
      touch $SCRATCH_MNT/a/newfile
      
      echo "Removing file"
      rm $SCRATCH_MNT/a/file
      
      Fixes: b9d0b389 ("btrfs: Add handler for invalidate page")
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b34c261
  20. 10 Oct, 2016 1 commit
    • Al Viro's avatar
      [btrfs] fix check_direct_IO() for non-iovec iterators · cd27e455
      Al Viro authored
      looking for duplicate ->iov_base makes sense only for
      iovec-backed iterators; for kvec-backed ones it's pointless,
      for bvec-backed ones it's pointless and broken on 32bit (we
      walk through an array of struct bio_vec accessing them as if
      they were struct iovec; works by accident on 64bit, but on
      32bit it'll blow up) and for pipe-backed ones it's pointless
      and ends up oopsing.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cd27e455
  21. 08 Oct, 2016 1 commit
  22. 28 Sep, 2016 1 commit
  23. 27 Sep, 2016 1 commit
  24. 26 Sep, 2016 4 commits
  25. 22 Sep, 2016 1 commit
  26. 16 Sep, 2016 1 commit
  27. 14 Sep, 2016 1 commit
  28. 25 Aug, 2016 1 commit
    • Wang Xiaoguang's avatar
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang authored
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      18513091
  29. 07 Aug, 2016 1 commit
    • Jens Axboe's avatar
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe authored
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1eff9d32
  30. 01 Aug, 2016 1 commit
    • Filipe Manana's avatar
      Btrfs: improve performance on fsync against new inode after rename/unlink · 44f714da
      Filipe Manana authored
      With commit 56f23fdb ("Btrfs: fix file/data loss caused by fsync after
      rename and new inode") we got simple fix for a functional issue when the
      following sequence of actions is done:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        move/rename existing file A from directory D to directory E
        create a new file named A at directory D
        fsync the new file
        power fail
      
      The solution was to simply detect such scenario and fallback to a full
      transaction commit when we detect it. However this turned out to had a
      significant impact on throughput (and a bit on latency too) for benchmarks
      using the dbench tool, which simulates real workloads from smbd (Samba)
      servers. For example on a test vm (with a debug kernel):
      
      Unpatched:
      Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms
      
      Patched:
      Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms
      
      The patched results (this patch is applied) are similar to the results of
      a kernel with the commit 56f23fdb ("Btrfs: fix file/data loss caused
      by fsync after rename and new inode") reverted.
      
      This change avoids the fallback to a transaction commit and instead makes
      sure all the names of the conflicting inode (the one that had a name in a
      past transaction that matches the name of the new file in the same parent
      directory) are logged so that at log replay time we don't lose neither the
      new file nor the old file, and the old file gets the name it was renamed
      to.
      
      This also ends up avoiding a full transaction commit for a similar case
      that involves an unlink instead of a rename of the old file:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        remove file A
        create a new file named A at directory D
        fsync the new file
        power fail
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      44f714da