1. 23 Aug, 2021 40 commits
    • Qu Wenruo's avatar
      btrfs: subpage: fix writeback which does not have ordered extent · cc1d0d93
      Qu Wenruo authored
      [BUG]
      When running fsstress with subpage RW support, there are random
      BUG_ON()s triggered with the following trace:
      
       kernel BUG at fs/btrfs/file-item.c:667!
       Internal error: Oops - BUG: 0 [#1] SMP
       CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
       Hardware name: Radxa ROCK Pi 4B (DT)
       Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
       pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
       pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
       lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
       Call trace:
        btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
        btrfs_submit_bio_start+0x20/0x30 [btrfs]
        run_one_async_start+0x28/0x44 [btrfs]
        btrfs_work_helper+0x128/0x1b4 [btrfs]
        process_one_work+0x22c/0x430
        worker_thread+0x70/0x3a0
        kthread+0x13c/0x140
        ret_from_fork+0x10/0x30
      
      [CAUSE]
      Above BUG_ON() means there is some bio range which doesn't have ordered
      extent, which indeed is worth a BUG_ON().
      
      Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
      subpage dirty bitmap to record which range is dirty and should be
      written back.
      
      This means, if we submit bio for a subpage range, we do not only need to
      clear page dirty, but also need to clear subpage dirty bits.
      
      In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
      any range we submit a bio.
      
      But there is loophole, if we hit a range which is beyond i_size, we just
      call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
      then break out, without clearing the subpage dirty.
      
      This means, if we hit above branch, the subpage dirty bits are still
      there, if other range of the page get dirtied and we need to writeback
      that page again, we will submit bio for the old range, leaving a wild
      bio range which doesn't have ordered extent.
      
      [FIX]
      Fix it by always calling btrfs_page_clear_dirty() in
      __extent_writepage_io().
      
      Also to avoid such problem from happening again, add a new assert,
      btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
      dirty bits are cleared before exiting __extent_writepage_io().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc1d0d93
    • Qu Wenruo's avatar
      btrfs: make relocate_one_page() handle subpage case · c2832898
      Qu Wenruo authored
      For subpage case, one page of data reloc inode can contain several file
      extents, like this:
      
      |<--- File extent A --->| FE B | FE C |<--- File extent D -->|
      		|<--------- Page --------->|
      
      We can no longer use PAGE_SIZE directly for various operations.
      
      This patch will relocate_one_page() to handle subpage case by:
      - Iterating through all extents of a cluster when marking pages
        When marking pages dirty and delalloc, we need to check the cluster
        extent boundary.
        Now we introduce a loop to go extent by extent of a page, until we
        either finished the last extent, or reach the page end.
      
        By this, regular sectorsize == PAGE_SIZE can still work as usual, since
        we will do that loop only once.
      
      - Iteration start from max(page_start, extent_start)
        Since we can have the following case:
      			| FE B | FE C |<--- File extent D -->|
      		|<--------- Page --------->|
        Thus we can't always start from page_start, but do a
        max(page_start, extent_start)
      
      - Iteration end when the cluster is exhausted
        Similar to previous case, the last file extent can end before the page
        end:
      |<--- File extent A --->| FE B | FE C |
      		|<--------- Page --------->|
        In this case, we need to manually exit the loop after we have finished
        the last extent of the cluster.
      
      - Reserve metadata space for each extent range
        Since now we can hit multiple ranges in one page, we should reserve
        metadata for each range, not simply PAGE_SIZE.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2832898
    • Qu Wenruo's avatar
      btrfs: reloc: factor out relocation page read and dirty part · f47960f4
      Qu Wenruo authored
      In function relocate_file_extent_cluster(), we have a big loop for
      marking all involved page delalloc.
      
      That part is long enough to be contained in one function, so this patch
      will move that code chunk into a new function, relocate_one_page().
      
      This also provides enough space for later subpage work.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f47960f4
    • Qu Wenruo's avatar
      btrfs: rework lzo_decompress_bio() to make it subpage compatible · a6e66e6f
      Qu Wenruo authored
      For the initial subpage support, although we won't support compressed
      write, we still need to support compressed read.
      
      But for lzo_decompress_bio() it has several problems:
      
      - The abuse of PAGE_SIZE for boundary detection
        For subpage case, we should follow sectorsize to detect the padding
        zeros.
        Using PAGE_SIZE will cause subpage compress read to skip certain
        bytes, and causing read error.
      
      - Too many helper variables
        There are half a dozen helper variables, which is only making things
        harder to read
      
      This patch will rework lzo_decompress_bio() to make it work for subpage:
      
      - Use sectorsize to do boundary check, while still use PAGE_SIZE for
        page switching
        This allows us to have the same on-disk format for 4K sectorsize fs,
        while take advantage of larger page size.
      
      - Use two main cursors
        Only @cur_in and @cur_out is utilized as the main cursor.
        The helper variables will only be declared inside the loop, and only 2
        helper variables needed.
      
      - Introduce a helper function to copy compressed segment payload
        Introduce a new helper, copy_compressed_segment(), to copy a
        compressed segment to workspace buffer.
        This function will handle the page switching.
      
      Now the net result is, with all the excessive comments and new helper
      function, the refactored code is still smaller, and easier to read.
      
      For other decompression code, they have no special padding rule, thus no
      need to bother for initial subpage support, but will be refactored to
      the same style later.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6e66e6f
    • Qu Wenruo's avatar
      btrfs: rework btrfs_decompress_buf2page() · 1c3dc173
      Qu Wenruo authored
      There are several bugs inside the function btrfs_decompress_buf2page()
      
      - @start_byte doesn't take bvec.bv_offset into consideration
        Thus it can't handle case where the target range is not page aligned.
      
      - Too many helper variables
        There are tons of helper variables, @buf_offset, @current_buf_start,
        @start_byte, @prev_start_byte, @working_bytes, @bytes.
        This hurts anyone who wants to read the function.
      
      - No obvious main cursor for the iteartion
        A new problem caused by previous problem.
      
      - Comments for parameter list makes no sense
        Like @buf_start is the offset to @buf, or offset inside the full
        decompressed extent? (Spoiler alert, the later case)
        And @total_out acts more like @buf_start + @size_of_buf.
      
        The worst is @disk_start.
        The real meaning of it is the file offset of the full decompressed
        extent.
      
      This patch will rework the whole function by:
      
      - Add a proper comment with ASCII art to explain the parameter list
      
      - Rework parameter list
        The old @buf_start is renamed to @decompressed, to show how many bytes
        are already decompressed inside the full decompressed extent.
        The old @total_out is replaced by @buf_len, which is the decompressed
        data size.
        For old @disk_start and @bio, just pass @compressed_bio in.
      
      - Use single main cursor
        The main cursor will be @cur_file_offset, to show what's the current
        file offset.
        Other helper variables will be declared inside the main loop, and only
        minimal amount of helper variables:
        * offset_inside_decompressed_buf:	The only real helper
        * copy_start_file_offset:		File offset we start memcpy
        * bvec_file_offset:			File offset of current bvec
      
      Even with all these extensive comments, the final function is still
      smaller than the original function, which is definitely a win.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c3dc173
    • Qu Wenruo's avatar
      btrfs: grab correct extent map for subpage compressed extent read · 557023ea
      Qu Wenruo authored
      [BUG]
      When subpage compressed read write support is enabled, btrfs/038 always
      fails with EIO.
      
      A simplified script can easily trigger the problem:
      
        mkfs.btrfs -f -s 4k $dev
        mount $dev $mnt -o compress=lzo
      
        xfs_io -f -c "truncate 118811" $mnt/foo
        xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null
      
        sync
        btrfs subvolume snapshot -r $mnt $mnt/mysnap1
      
        xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
        sync
      
        xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
        xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null
      
        sync
        btrfs subvolume snapshot -r $mnt $mnt/mysnap2
      
        cat $mnt/mysnap2/foo
        # Above cat will fail due to EIO
      
      [CAUSE]
      The problem is in btrfs_submit_compressed_read().
      
      When it tries to grab the extent map of the read range, it uses the
      following call:
      
      	em = lookup_extent_mapping(em_tree,
        				   page_offset(bio_first_page_all(bio)),
      				   fs_info->sectorsize);
      
      The problem is in the page_offset(bio_first_page_all(bio)) part.
      
      The offending inode has the following file extent layout
      
              item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 13680640 nr 4096
                      extent data offset 0 nr 4096 ram 4096
                      extent compression 0 (none)
              item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 0 nr 0
              item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 13676544 nr 4096
                      extent data offset 0 nr 53248 ram 86016
                      extent compression 2 (lzo)
      
      And the bio passed in has the following parameters:
      
      page_offset(bio_first_page_all(bio))	= 131072
      bio_first_bvec_all(bio)->bv_offset	= 65536
      
      If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
      we will get an extent map for file offset 131072, not 196608.
      
      This means we read uncompressed data from disk, and later decompression
      will definitely fail.
      
      [FIX]
      Take bv_offset into consideration when trying to grab an extent map.
      
      And add an ASSERT() to ensure we're really getting a compressed extent.
      
      Thankfully this won't affect anything but subpage, thus we only need to
      ensure this patch get merged before we enabled basic subpage support.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      557023ea
    • Qu Wenruo's avatar
      btrfs: disable compressed readahead for subpage · ca62e85d
      Qu Wenruo authored
      For current subpage support, we only support 64K page size with 4K
      sector size.
      
      This makes compressed readahead less effective, as maximum compressed
      extent size is only 128K, 2x the page size.
      
      On the other hand, the function add_ra_bio_pages() is still assuming
      sectorsize == PAGE_SIZE, and code change may affect 4K page size
      systems.
      
      So for now, let's disable subpage compressed readahead for now.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca62e85d
    • Qu Wenruo's avatar
      btrfs: subpage: check if there are compressed extents inside one page · 3670e645
      Qu Wenruo authored
      [BUG]
      When testing experimental subpage compressed write support, it hits a
      NULL pointer dereference inside read path:
      
       Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
       pc : __pi_memcmp+0x28/0x1ec
       lr : check_data_csum+0xd0/0x274 [btrfs]
       Call trace:
        __pi_memcmp+0x28/0x1ec
        btrfs_verify_data_csum+0xf4/0x244 [btrfs]
        end_bio_extent_readpage+0x1d0/0x6b0 [btrfs]
        bio_endio+0x15c/0x1dc
        end_workqueue_fn+0x44/0x64 [btrfs]
        btrfs_work_helper+0x74/0x250 [btrfs]
        process_one_work+0x1d4/0x47c
        worker_thread+0x180/0x400
        kthread+0x11c/0x120
        ret_from_fork+0x10/0x30
       Code: 54000261 d100044c d343fd8c f8408403 (f8408424)
       ---[ end trace 9e2c59f33ea40866 ]---
      
      [CAUSE]
      When reading two compressed extents inside the same page, like the
      following layout, we trigger above crash:
      
      	0	32K	64K
      	|-------|\\\\\\\|
      	     |	     \- Compressed extent (A)
      	     \--------- Compressed extent (B)
      
      For compressed read, we don't need to populate its io_bio->csum, as we
      rely on compressed_bio->csum to verify the compressed data, and then
      copy the decompressed to inode pages.
      
      Normally btrfs_verify_data_csum() skip such page by checking and
      clearing its PageChecked flag
      
      But since that flag is still for the full page, when endio for inode
      page range [0, 32K) gets executed, it clears PageChecked flag for the
      full page.
      
      Then when endio for inode page range [32K, 64K) gets executed, since the
      page no longer has PageChecked flag, it just continues checking, even
      though io_bio->csum is NULL.
      
      [FIX]
      Thankfully there are only two users of PageChecked bit:
      
      - Cow fixup
        Since subpage has its own way to trace page dirty (dirty_bitmap) and
        ordered bit (ordered_bitmap), it should never trigger cow fixup.
      
      - Compressed read
        We can distinguish such read by just checking io_bio->csum.
      
      So just check io_bio->csum before doing the verification to avoid such
      NULL pointer dereference.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3670e645
    • Qu Wenruo's avatar
      btrfs: reset this_bio_flag to avoid inheriting old flags · 4c37a793
      Qu Wenruo authored
      In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a
      compressed extent.
      
      This is fine, as for PAGE_SIZE == sectorsize case, we can only have one
      sector for one page, thus @this_bio_flag will only be set at most once.
      
      But for subpage case, after hitting a compressed extent, @this_bio_flag
      will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular
      extent.
      
      This will lead to various read errors, and causing new ASSERT() in
      incoming subpage patches, which adds more strict check in
      btrfs_submit_compressed_read().
      
      Fix it by declaring @this_bio_flag inside the main loop and reset its
      value for each iteration.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c37a793
    • David Sterba's avatar
      btrfs: constify and cleanup variables in comparators · 214cc184
      David Sterba authored
      Comparators just read the data and thus get const parameters. This
      should be also preserved by the local variables, update all comparators
      passed to sort or bsearch.
      
      Cleanups:
      
      - unnecessary casts are dropped
      - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
        and 'inline' is dropped as the function address is taken
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      214cc184
    • David Sterba's avatar
      btrfs: simplify data stripe calculation helpers · d58ede8d
      David Sterba authored
      There are two helpers doing the same calculations based on nparity and
      ncopies. calc_data_stripes can be simplified into one expression, so far
      we don't have profile with both copies and parity, so there's no
      effective change. calc_stripe_length should reuse the helper and not
      repeat the same calculation.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d58ede8d
    • David Sterba's avatar
      btrfs: merge alloc_device helpers · fe4f46d4
      David Sterba authored
      The device allocation is split to two functions, but one just calls the
      other and they're very far in the file. Merge them together.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe4f46d4
    • David Sterba's avatar
      btrfs: uninline btrfs_bg_flags_to_raid_index · 500a44c9
      David Sterba authored
      The helper does a simple translation from block group flags to index to
      the btrfs_raid_array table. There's no apparent reason to inline the
      function, the translation happens usually once per function and is not
      called in a loop.
      
      Making it a proper function saves quite some binary code (x86_64,
      release config):
      
         text    data     bss     dec     hex filename
      1164011   19253   14912 1198176  124860 pre/btrfs.ko
      1161559   19253   14912 1195724  123ecc post/btrfs.ko
      
      DELTA: -2451
      
      Also add the const attribute as there are no side effects, this could
      help compiler to optimize a few things without the function body.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      500a44c9
    • David Sterba's avatar
      btrfs: tree-checker: add missing stripe checks for raid1c3/4 profiles · 6c154ba4
      David Sterba authored
      The stripe checks for raid1c3/raid1c4 are missing in the sequence in
      btrfs_check_chunk_valid.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c154ba4
    • David Sterba's avatar
      btrfs: tree-checker: use table values for stripe checks · 0ac6e06b
      David Sterba authored
      There are hardcoded values in several checks regarding chunks and stripe
      constraints. We have that defined in the raid table and ought to use it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ac6e06b
    • David Sterba's avatar
      btrfs: make btrfs_next_leaf static inline · 809d6902
      David Sterba authored
      btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
      to header to avoid the function call overhead.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      809d6902
    • David Sterba's avatar
      btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pending · f41b6ba9
      David Sterba authored
      In commit e65f152e ("btrfs: refactor how we finish ordered extent io
      for endio functions") there was last caller not using 1 for the uptodate
      parameter. Now there's only one, passing 1, so we can remove it and
      simplify the code.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f41b6ba9
    • David Sterba's avatar
      btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered · 25c1252a
      David Sterba authored
      The uptodate parameter should be bool, change the type.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25c1252a
    • Qu Wenruo's avatar
      btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() · a129ffb8
      Qu Wenruo authored
      Since commit d75855b4 ("btrfs: Remove
      extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
      and adds btrfs_writepage_cow_fixup() function, there is no need to
      follow the old hook parameters.
      
      Remove the @start and @end hook, since currently the fixup check is full
      page check, it doesn't need @start and @end hook.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a129ffb8
    • Marcos Paulo de Souza's avatar
      btrfs: introduce btrfs_lookup_match_dir · a7d1c5dc
      Marcos Paulo de Souza authored
      btrfs_search_slot is called in multiple places in dir-item.c to search
      for a dir entry, and then calling btrfs_match_dir_name to return a
      btrfs_dir_item.
      
      In order to reduce the number of callers of btrfs_search_slot, create a
      common function that looks for the dir key, and if found call
      btrfs_match_dir_item_name.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7d1c5dc
    • Marcos Paulo de Souza's avatar
      btrfs: remove unneeded return variable in btrfs_lookup_file_extent · f8ee80de
      Marcos Paulo de Souza authored
      We can return from btrfs_search_slot directly which also shows that it
      follows the same return value convention.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f8ee80de
    • Marcos Paulo de Souza's avatar
      btrfs: use btrfs_next_leaf instead of btrfs_next_item when slots > nritems · ad9a9378
      Marcos Paulo de Souza authored
      After calling btrfs_search_slot is a common practice to check if the
      slot found isn't bigger than number of slots in the current leaf, and if
      so, search for the same key in the next leaf by calling btrfs_next_leaf,
      which calls btrfs_next_old_leaf to do the job.
      
      Calling btrfs_next_item in the same situation would end up in the same
      code flow, since
      
      * btrfs_next_item
        * btrfs_next_old_item
          * if slot >= nritems(curr_leaf)
            btrfs_next_old_leaf
      
      Change btrfs_verify_dev_extents and calculate_emulated_zone_size
      functions to use btrfs_next_leaf in the same situation.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad9a9378
    • Filipe Manana's avatar
      btrfs: remove ignore_offset argument from btrfs_find_all_roots() · c7bcbb21
      Filipe Manana authored
      Currently all the callers of btrfs_find_all_roots() pass a value of false
      for its ignore_offset argument. This makes the argument pointless and we
      can remove it and make btrfs_find_all_roots() always pass false as the
      ignore_offset argument for btrfs_find_all_roots_safe(). So just do that.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7bcbb21
    • Filipe Manana's avatar
      btrfs: avoid unnecessary lock and leaf splits when updating inode in the log · 2ac691d8
      Filipe Manana authored
      During a fast fsync, if we have already fsynced the file before and in the
      current transaction, we can make the inode item update more efficient and
      avoid acquiring a write lock on the leaf's parent.
      
      To update the inode item we are always using btrfs_insert_empty_item() to
      get a path pointing to the inode item, which calls btrfs_search_slot()
      with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) +
      sizeof(struct btrfs_item)', and that always results in the search taking
      a write lock on the level 1 node that is the parent of the leaf that
      contains the inode item. This adds unnecessary lock contention on log
      trees when we have multiple fsyncs in parallel against inodes in the same
      subvolume, which has a very significant impact due to the fact that log
      trees are short lived and their height very rarely goes beyond level 2.
      
      Also, by using btrfs_insert_empty_item() when we need to update the inode
      item, we also end up splitting the leaf of the existing inode item when
      the leaf has an amount of free space smaller than the size of an inode
      item.
      
      Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument,
      when we know the inode item already exists in the log. This avoids these
      two inefficiencies.
      
      The following script, using fio, was used to perform the tests:
      
        $ cat fio-test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 4 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
        echo "mount options: $MOUNT_OPTIONS"
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were done on a physical machine, with 12 cores, 64G of RAM,
      using a NVMEe device and using a non-debug kernel config (the default one
      from Debian). The summary line from fio is provided below for each test
      run.
      
      With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:
      
      Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec
      After:  WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec
      
      +1.4% throughput, -1.2% runtime
      
      With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:
      
      Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec
      After:  WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec
      
      +2.2% throughput, -2.1% runtime
      
      The changes are small but it's possible to be better on faster hardware as
      in the test machine used disk utilization was pretty much 100% during the
      whole time the tests were running (observed with 'iostat -xz 1').
      
      The tests also included the previous patch with the subject of:
      "btrfs: avoid unnecessary log mutex contention when syncing log".
      So they compared a branch without that patch and without this patch versus
      a branch with these two patches applied.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ac691d8
    • Filipe Manana's avatar
      btrfs: remove unnecessary list head initialization when syncing log · e68107e5
      Filipe Manana authored
      One of the last steps of syncing the log is to remove all log contexts
      from the root's list of contexts, done at btrfs_remove_all_log_ctxs().
      There we iterate over all the contexts in the list and delete each one
      from the list, and after that we call INIT_LIST_HEAD() on the list. That
      is unnecessary since at that point the list is empty.
      
      So just remove the INIT_LIST_HEAD() call. It's not needed, increases code
      size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after
      this change) and increases two critical sections delimited by log mutexes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e68107e5
    • Filipe Manana's avatar
      btrfs: avoid unnecessary log mutex contention when syncing log · e1a6d264
      Filipe Manana authored
      When syncing the log we acquire the root's log mutex just to update the
      root's last_log_commit. This is unnecessary because:
      
      1) At this point there can only be one task updating this value, which is
         the task committing the current log transaction. Any task that enters
         btrfs_sync_log() has to wait for the previous log transaction to commit
         and wait for the current log transaction to commit if someone else
         already started it (in this case it never reaches to the point of
         updating last_log_commit, as that is done by the committing task);
      
      2) All readers of the root's last_log_commit don't acquire the root's
         log mutex. This is to avoid blocking the readers, potentially for too
         long and because getting a stale value of last_log_commit does not
         cause any functional problem, in the worst case getting a stale value
         results in logging an inode unnecessarily. Plus it's actually very
         rare to get a stale value that results in unnecessarily logging the
         inode.
      
      So in order to avoid unnecessary contention on the root's log mutex,
      which is used for several different purposes, like starting/joining a
      log transaction and starting writeback of a log transaction, stop
      acquiring the log mutex for updating the root's last_log_commit.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1a6d264
    • Filipe Manana's avatar
      btrfs: remove racy and unnecessary inode transaction update when using no-holes · cceaa89f
      Filipe Manana authored
      When using the NO_HOLES feature and expanding the size of an inode, we
      update the inode's last_trans, last_sub_trans and last_log_commit fields
      at maybe_insert_hole() so that a fsync does know that the inode needs to
      be logged (by making sure that btrfs_inode_in_log() returns false). This
      happens for expanding truncate operations, buffered writes, direct IO
      writes and when cloning extents to an offset greater than the inode's
      i_size.
      
      However the way we do it is racy, because in between setting the inode's
      last_sub_trans and last_log_commit fields, the log transaction ID that was
      assigned to last_sub_trans might be committed before we read the root's
      last_log_commit and assign that value to last_log_commit. If that happens
      it would make a future call to btrfs_inode_in_log() return true. This is
      a race that should be extremely unlikely to be hit in practice, and it is
      the same that was described by commit bc0939fc ("btrfs: fix race
      between marking inode needs to be logged and log syncing").
      
      The fix would simply be to set last_log_commit to the value we assigned
      to last_sub_trans minus 1, like it was done in that commit. However
      updating these two fields plus the last_trans field is pointless here
      because all the callers of btrfs_cont_expand() (which is the only
      caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans()
      or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either
      btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the
      next fsync will log the inode, as it makes btrfs_inode_in_log() return
      false.
      
      So just remove the code that explicitly sets the inode's last_trans,
      last_sub_trans and last_log_commit fields.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cceaa89f
    • Filipe Manana's avatar
      btrfs: stop doing GFP_KERNEL memory allocations in the ref verify tool · 5a656c36
      Filipe Manana authored
      In commit 351cbf6e ("btrfs: use nofs allocations for running delayed
      items") we wrapped all btree updates when running delayed items with
      memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion
      detected by lockdep involving reclaim and the mutex of delayed nodes.
      
      The problem is because the ref verify tool does some memory allocations
      with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode
      eviction, which requires locking the mutex of an inode's delayed node.
      On the other hand the ref verify tool is called when allocating metadata
      extents as part of operations that modify a btree, which is a problem when
      running delayed nodes, where we do btree updates while holding the mutex
      of a delayed node. This is what caused the lockdep warning.
      
      Instead of wrapping every btree update when running delayed nodes, change
      the ref verify tool to never do GFP_KERNEL allocations, because:
      
      1) We get less repeated code, which at the moment does not even have a
         comment mentioning why we need to setup the NOFS context, which is a
         recommended good practice as mentioned at
         Documentation/core-api/gfp_mask-from-fs-io.rst
      
      2) The ref verify tool is something meant only for debugging and not
         something that should be enabled on non-debug / non-development
         kernels;
      
      3) We may have yet more places outside delayed-inode.c where we have
         similar problem: doing btree updates while holding some lock and
         then having the GFP_KERNEL memory allocations, from the ref verify
         tool, trigger reclaim and trying again to acquire the same lock
         through the reclaim path.
         Or we could get more such cases in the future, therefore this change
         prevents getting into similar cases when using the ref verify tool.
      
      Curiously most of the memory allocations done by the ref verify tool
      were already using GFP_NOFS, except a few ones for no apparent reason.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a656c36
    • Filipe Manana's avatar
      btrfs: improve the batch insertion of delayed items · 506650dc
      Filipe Manana authored
      When we insert the delayed items of an inode, which corresponds to the
      directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we
      do the following:
      
      1) Pick the first delayed item from the rbtree and insert it into the
         fs/subvolume btree, using btrfs_insert_empty_item() for that;
      
      2) Without releasing the path returned by btrfs_insert_empty_item(),
         keep collecting as many consecutive delayed items from the rbtree
         as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the
         immediate successor of the previously picked item and as long as
         they fit in the available space of the leaf the path points to;
      
      3) Then insert all the collected items into the leaf;
      
      4) Release the reserve metadata space for each collected item and
         release each item (implies deleting from the rbtree);
      
      5) Unlock the path.
      
      While this is much better than inserting items one by one, it can be
      improved in a few aspects:
      
      1) Instead of adding items based on the remaining free space of the
         leaf, collect as many items that can fit in a leaf and bulk insert
         them. This results in less and larger batches, reducing the total
         amount of time to insert the delayed items. For example when adding
         100K files to a directory, we ended up creating 1658 batches with
         very variable sizes ranging from 1 item to 118 items, on a filesystem
         with a node/leaf size of 16K. After this change, we end up with 839
         batches, with the vast majority of them having exactly 120 items;
      
      2) We do the search for more items to batch, by iterating the rbtree,
         while holding a write lock on the leaf;
      
      3) While still holding the leaf locked, we are releasing the reserved
         metadata for each item and then deleting each item, keeping a write
         lock on the leaf for longer than necessary. Releasing the delayed items
         one by one can take a significant amount of time, because deleting
         them from the rbtree can often be a bit slow when the deletion results
         in rebalancing the rbtree.
      
      So change this so that we try to create larger batches, with a total
      item size up to the maximum a leaf can support, and by unlocking the leaf
      immediately after inserting the items, releasing the reserved metadata
      space of each item and releasing each item without holding the write lock
      on the leaf.
      
      The following script that runs fs_mark was used to test this change:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
        FILES=1000000
        THREADS=16
        FILE_SIZE=0
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      It was run on machine with 12 cores, 64G of ram, using a NVMe device and
      using a non-debug kernel config (Debian's default config).
      
      Results before this change:
      
      FSUse%        Count         Size    Files/sec         App Overhead
           1     16000000            0      76182.1             72223046
           3     32000000            0      62746.9             80776528
           5     48000000            0      77029.0             93022381
           6     64000000            0      73691.6             95251075
           8     80000000            0      66288.0             85089634
      
      Results after this change:
      
      FSUse%        Count         Size    Files/sec         App Overhead
           1     16000000            0      79049.5 (+3.7%)     69700824
           3     32000000            0      65248.9 (+3.9%)     80583693
           5     48000000            0      77991.4 (+1.2%)     90040908
           6     64000000            0      75096.8 (+1.9%)     89862241
           8     80000000            0      66926.8 (+1.0%)     84429169
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      506650dc
    • Qu Wenruo's avatar
      btrfs: rescue: allow ibadroots to skip bad extent tree when reading block group items · 2b29726c
      Qu Wenruo authored
      When extent tree gets corrupted, normally it's not extent tree root, but
      one toasted tree leaf/node.
      
      In that case, rescue=ibadroots mount option won't help as it can only
      handle the extent tree root corruption.
      
      This patch will enhance the behavior by:
      
      - Allow fill_dummy_bgs() to ignore -EEXIST error
      
        This means we may have some block group items read from disk, but
        then hit some error halfway.
      
      - Fallback to fill_dummy_bgs() if any error gets hit in
        btrfs_read_block_groups()
      
        Of course, this still needs rescue=ibadroots mount option.
      
      With that, rescue=ibadroots can handle extent tree corruption more
      gracefully and allow a better recover chance.
      Reported-by: default avatarZhenyu Wu <wuzy001@gmail.com>
      Link: https://www.spinics.net/lists/linux-btrfs/msg114424.htmlReviewed-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b29726c
    • Marcos Paulo de Souza's avatar
      btrfs: pass NULL as trans to btrfs_search_slot if we only want to search · 6534c0c9
      Marcos Paulo de Souza authored
      Using a transaction in btrfs_search_slot is only useful when we are
      searching to add or modify the tree. When the function is used for
      searching, insert length and mod arguments are 0, there is no need to
      use a transaction.
      
      No functional changes, changing for consistency.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6534c0c9
    • Filipe Manana's avatar
      btrfs: continue readahead of siblings even if target node is in memory · 069a2e37
      Filipe Manana authored
      At reada_for_search(), when attempting to readahead a node or leaf's
      siblings, we skip the readahead of the siblings if the node/leaf is
      already in memory. That is probably fine for the READA_FORWARD and
      READA_BACK readahead types, as they are used on contexts where we
      end up reading some consecutive leaves, but usually not the whole btree.
      
      However for a READA_FORWARD_ALWAYS mode, currently only used for full
      send operations, it does not make sense to skip the readahead if the
      target node or leaf is already loaded in memory, since we know the caller
      is visiting every node and leaf of the btree in ascending order.
      
      So change the behaviour to not skip the readahead when the target node is
      already in memory and the readahead mode is READA_FORWARD_ALWAYS.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk, with 32GiB of RAM and
      using a non-debug kernel config (Debian's default config).
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        file_count=2000000
        add_files $file_count 0 4
      
        echo
        echo "Creating snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The duration of the full send operations, in seconds, were the following:
      
      Before this change:  85 seconds
      After this change:   76 seconds (-11.2%)
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      069a2e37
    • David Sterba's avatar
      btrfs: check-integrity: drop kmap/kunmap for block pages · 5da38479
      David Sterba authored
      The pages in block_ctx have never been allocated from highmem (in
      btrfsic_read_block) so the mapping is pointless and can be removed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5da38479
    • David Sterba's avatar
      btrfs: compression: drop kmap/kunmap from generic helpers · 4c2bf276
      David Sterba authored
      The pages in compressed_pages are not from highmem anymore so we can
      drop the mapping for checksum calculation and inline extent.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c2bf276
    • David Sterba's avatar
      btrfs: compression: drop kmap/kunmap from zstd · bbaf9715
      David Sterba authored
      As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
      simply page_address and kunmap is a no-op.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbaf9715
    • David Sterba's avatar
      btrfs: compression: drop kmap/kunmap from zlib · 696ab562
      David Sterba authored
      As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
      simply page_address and kunmap is a no-op.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      696ab562
    • David Sterba's avatar
      btrfs: compression: drop kmap/kunmap from lzo · 8c945d32
      David Sterba authored
      As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
      simply page_address and kunmap is a no-op.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c945d32
    • David Sterba's avatar
      btrfs: drop from __GFP_HIGHMEM all allocations · b0ee5e1e
      David Sterba authored
      The highmem flag is used for allocating pages for compression and for
      raid56 pages. The high memory makes sense on 32bit systems but is not
      without problems. On 64bit system's it's just another layer of wrappers.
      
      The time the pages are allocated for compression or raid56 is relatively
      short (about a transaction commit), so the pages are not blocked
      indefinitely. As the number of pages depends on the amount of data being
      written/read, there's a theoretical problem. A fast device on a 32bit
      system could use most of the low memory pool, while with the highmem
      allocation that would not happen. This was possibly the original idea
      long time ago, but nowadays we optimize for 64bit systems.
      
      This patch removes all usage of the __GFP_HIGHMEM flag for page
      allocation, the kmap/kunmap are still in place and will be removed in
      followup patches. Remaining is masking out the bit in
      alloc_extent_state and __lookup_free_space_inode, that can safely stay.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0ee5e1e
    • Anand Jain's avatar
      btrfs: cleanup fs_devices pointer usage in btrfs_trim_fs · 23608d51
      Anand Jain authored
      Drop variable 'devices' (used only once) and add new variable for
      the fs_devices, so it is used at two locations within btrfs_trim_fs()
      function and also helps to access fs_devices->devices.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      23608d51
    • Marcos Paulo de Souza's avatar
      btrfs: remove max argument from generic_bin_search · 67d5e289
      Marcos Paulo de Souza authored
      Both callers use btrfs_header_nritems to feed the max argument.  Remove
      the argument and let generic_bin_search call it itself.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67d5e289