- 26 Jun, 2023 40 commits
-
-
Theodore Ts'o authored
This was noticed by a user who noticied that the mtime of a file backing a loopback device was getting bumped when the loopback device is mounted read/only. Note: This doesn't show up when doing a loopback mount of a file directly, via "mount -o ro /tmp/foo.img /mnt", since the loop device is set read-only when mount automatically creates loop device. However, this is noticeable for a LUKS loop device like this: % cryptsetup luksOpen /tmp/foo.img test % mount -o ro /dev/loop0 /mnt ; umount /mnt or, if LUKS is not in use, if the user manually creates the loop device like this: % losetup /dev/loop0 /tmp/foo.img % mount -o ro /dev/loop0 /mnt ; umount /mnt The modified mtime causes rsync to do a rolling checksum scan of the file on the local and remote side, incrementally increasing the time to rsync the not-modified-but-touched image file. Fixes: eee00237 ("ext4: commit super block if fs record error when journal record without error") Cc: stable@kernel.org Link: https://lore.kernel.org/r/ZIauBR7YiV3rVAHL@glitchReported-by: Sean Greenslade <sean@seangreenslade.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
We got a NULL pointer dereference issue below while running generic/475 I/O failure pressure test. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP PTI CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190 RIP: 0010:jbd2_journal_set_features+0x13d/0x430 ... Call Trace: <TASK> ? __die+0x23/0x60 ? page_fault_oops+0xa4/0x170 ? exc_page_fault+0x67/0x170 ? asm_exc_page_fault+0x26/0x30 ? jbd2_journal_set_features+0x13d/0x430 jbd2_journal_revoke+0x47/0x1e0 __ext4_forget+0xc3/0x1b0 ext4_free_blocks+0x214/0x2f0 ext4_free_branches+0xeb/0x270 ext4_ind_truncate+0x2bf/0x320 ext4_truncate+0x1e4/0x490 ext4_handle_inode_extension+0x1bd/0x2a0 ? iomap_dio_complete+0xaf/0x1d0 The root cause is the journal super block had been failed to write out due to I/O fault injection, it's uptodate bit was cleared by end_buffer_write_sync() and didn't reset yet in jbd2_write_superblock(). And it raced by journal_get_superblock()->bh_read(), unfortunately, the read IO is also failed, so the error handling in journal_fail_superblock() unexpectedly clear the journal->j_sb_buffer, finally lead to above NULL pointer dereference issue. If the journal super block had been read and verified, there is no need to call bh_read() read it again even if it has been failed to written out. So the fix could be simply move buffer_verified(bh) in front of bh_read(). Also remove a stale comment left in jbd2_journal_check_used_features(). Fixes: 51bacdba23d8 ("jbd2: factor out journal initialization from journal_get_superblock()") Reported-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616015547.3155195-1-yi.zhang@huaweicloud.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Chao Yu authored
freeze_bdev() can fail due to a lot of reasons, it needs to check its reason before later process. Fixes: 783d9485 ("ext4: add EXT4_IOC_GOINGDOWN ioctl") Cc: stable@kernel.org Signed-off-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230606073203.1310389-1-chao@kernel.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Rename ext4_quota_off_umount() to ext4_quotas_off(), and add type parameter to replace open code in ext4_enable_quotas(). Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230327141630.156875-3-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Yi found during a review of the patch "ext4: don't BUG on inconsistent journal feature" that when ext4_mark_recovery_complete() returns an error value, the error handling path does not turn off the enabled quotas, which triggers the following kmemleak: ================================================================ unreferenced object 0xffff8cf68678e7c0 (size 64): comm "mount", pid 746, jiffies 4294871231 (age 11.540s) hex dump (first 32 bytes): 00 90 ef 82 f6 8c ff ff 00 00 00 00 41 01 00 00 ............A... c7 00 00 00 bd 00 00 00 0a 00 00 00 48 00 00 00 ............H... backtrace: [<00000000c561ef24>] __kmem_cache_alloc_node+0x4d4/0x880 [<00000000d4e621d7>] kmalloc_trace+0x39/0x140 [<00000000837eee74>] v2_read_file_info+0x18a/0x3a0 [<0000000088f6c877>] dquot_load_quota_sb+0x2ed/0x770 [<00000000340a4782>] dquot_load_quota_inode+0xc6/0x1c0 [<0000000089a18bd5>] ext4_enable_quotas+0x17e/0x3a0 [ext4] [<000000003a0268fa>] __ext4_fill_super+0x3448/0x3910 [ext4] [<00000000b0f2a8a8>] ext4_fill_super+0x13d/0x340 [ext4] [<000000004a9489c4>] get_tree_bdev+0x1dc/0x370 [<000000006e723bf1>] ext4_get_tree+0x1d/0x30 [ext4] [<00000000c7cb663d>] vfs_get_tree+0x31/0x160 [<00000000320e1bed>] do_new_mount+0x1d5/0x480 [<00000000c074654c>] path_mount+0x22e/0xbe0 [<0000000003e97a8e>] do_mount+0x95/0xc0 [<000000002f3d3736>] __x64_sys_mount+0xc4/0x160 [<0000000027d2140c>] do_syscall_64+0x3f/0x90 ================================================================ To solve this problem, we add a "failed_mount10" tag, and call ext4_quota_off_umount() in this tag to release the enabled qoutas. Fixes: 11215630 ("ext4: don't BUG on inconsistent journal feature") Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230327141630.156875-2-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
Update the description in doc about the s_head parameter in the journal superblock. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20230322013353.1843306-4-yi.zhang@huaweicloud.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
Always enable 'JBD2_CYCLE_RECORD' journal option on ext4, letting the jbd2 continue to record new journal transactions from the recovered journal head or the checkpointed transactions in the previous mount. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230322013353.1843306-3-yi.zhang@huaweicloud.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
For a newly mounted file system, the journal committing thread always record new transactions from the start of the journal area, no matter whether the journal was clean or just has been recovered. So the logdump code in debugfs cannot dump continuous logs between each mount, it is disadvantageous to analysis corrupted file system image and locate the file system inconsistency bugs. If we get a corrupted file system in the running products and want to find out what has happened, besides lookup the system log, one effective way is to backtrack the journal log. But we may not always run e2fsck before each mount and the default fsck -a mode also cannot always checkout all inconsistencies, so it could left over some inconsistencies into the next mount until we detect it. Finally, transactions in the journal may probably discontinuous and some relatively new transactions has been covered, it becomes hard to analyse. If we could record transactions continuously between each mount, we could acquire more useful info from the journal. Like this: |Previous mount checkpointed/recovered logs|Current mount logs | |{------}{---}{--------} ... {------}| ... |{======}{========}...000000| And yes the journal area is limited and cannot record everything, the problematic transaction may also be covered even if we do this, but this is still useful for fuzzy tests and short-running products. This patch save the head blocknr in the superblock after flushing the journal or unmounting the file system, let the next mount could continue to record new transaction behind it. This change is backward compatible because the old kernel does not care about the head blocknr of the journal. It is also fine if we mount a clean old image without valid head blocknr, we fail back to set it to s_first just like before. Finally, for the case of mount an unclean file system, we could also get the journal head easily after scanning/replaying the journal, it will continue to record new transaction after the recovered transactions. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230322013353.1843306-2-yi.zhang@huaweicloud.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
journal->j_format_version is no longer used, remove it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-7-chengzhihao1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
Current journal_get_superblock() couple journal superblock checking and partial journal initialization, factor out initialization part from it to make things clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-6-chengzhihao1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
We should only check and set extented features if journal format version is 2, and now we check the in memory copy of the superblock 'journal->j_format_version', which relys on the parameter initialization sequence, switch to use the h_blocktype in superblock cloud be more clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-5-chengzhihao1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhang Yi authored
JBD2_HAS_[IN|RO_]COMPAT_FEATURE macros are no longer used, just remove them. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-4-chengzhihao1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhihao Cheng authored
As discussed in [1], 'sbi->s_journal_bdev != sb->s_bdev' will always become true if sbi->s_journal_bdev exists. Filesystem block device and journal block device are both opened with 'FMODE_EXCL' mode, so these two devices can't be same one. Then we can remove the redundant checking 'sbi->s_journal_bdev != sb->s_bdev' if 'sbi->s_journal_bdev' exists. [1] https://lore.kernel.org/lkml/f86584f6-3877-ff18-47a1-2efaa12d18b2@huawei.com/Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-3-chengzhihao1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Zhihao Cheng authored
Following process makes ext4 load stale buffer heads from last failed mounting in a new mounting operation: mount_bdev ext4_fill_super | ext4_load_and_init_journal | ext4_load_journal | jbd2_journal_load | load_superblock | journal_get_superblock | set_buffer_verified(bh) // buffer head is verified | jbd2_journal_recover // failed caused by EIO | goto failed_mount3a // skip 'sb->s_root' initialization deactivate_locked_super kill_block_super generic_shutdown_super if (sb->s_root) // false, skip ext4_put_super->invalidate_bdev-> // invalidate_mapping_pages->mapping_evict_folio-> // filemap_release_folio->try_to_free_buffers, which // cannot drop buffer head. blkdev_put blkdev_put_whole if (atomic_dec_and_test(&bdev->bd_openers)) // false, systemd-udev happens to open the device. Then // blkdev_flush_mapping->kill_bdev->truncate_inode_pages-> // truncate_inode_folio->truncate_cleanup_folio-> // folio_invalidate->block_invalidate_folio-> // filemap_release_folio->try_to_free_buffers will be skipped, // dropping buffer head is missed again. Second mount: ext4_fill_super ext4_load_and_init_journal ext4_load_journal ext4_get_journal jbd2_journal_init_inode journal_init_common bh = getblk_unmovable bh = __find_get_block // Found stale bh in last failed mounting journal->j_sb_buffer = bh jbd2_journal_load load_superblock journal_get_superblock if (buffer_verified(bh)) // true, skip journal->j_format_version = 2, value is 0 jbd2_journal_recover do_one_pass next_log_block += count_tags(journal, bh) // According to journal_tag_bytes(), 'tag_bytes' calculating is // affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3() // returns false because 'j->j_format_version >= 2' is not true, // then we get wrong next_log_block. The do_one_pass may exit // early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'. The filesystem is corrupted here, journal is partially replayed, and new journal sequence number actually is already used by last mounting. The invalidate_bdev() can drop all buffer heads even racing with bare reading block device(eg. systemd-udev), so we can fix it by invalidating bdev in error handling path in __ext4_fill_super(). Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171 Fixes: 25ed6e8a ("jbd2: enable journal clients to enable v2 checksumming") Cc: stable@vger.kernel.org # v3.5 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-2-chengzhihao1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Brian Foster authored
We've had reports of significant performance regression of sub-block (unaligned) direct writes due to the added exclusivity restrictions in ext4. The purpose of the exclusivity requirement for unaligned direct writes is to avoid data corruption caused by unserialized partial block zeroing in the iomap dio layer across overlapping writes. XFS has similar requirements for the same underlying reasons, yet doesn't suffer the extreme performance regression that ext4 does. The reason for this is that XFS utilizes IOMAP_DIO_OVERWRITE_ONLY mode, which allows for optimistic submission of concurrent unaligned I/O and kicks back writes that require partial block zeroing such that they can be submitted in a safe, exclusive context. Since ext4 already performs most of these checks pre-submission, it can support something similar without necessarily relying on the iomap flag and associated retry mechanism. Update the dio write submission path to allow concurrent submission of unaligned direct writes that are purely overwrite and so will not require block zeroing. To improve readability of the various related checks, move the unaligned I/O handling down into ext4_dio_write_checks(), where the dio draining and force wait logic can immediately follow the locking requirement checks. Finally, the IOMAP_DIO_OVERWRITE_ONLY flag is set to enable a warning check as a precaution should the ext4 overwrite logic ever become inconsistent with the zeroing expectations of iomap dio. The performance improvement of sub-block direct write I/O is shown in the following fio test on a 64xcpu guest vm: Test: fio --name=test --ioengine=libaio --direct=1 --group_reporting --overwrite=1 --thread --size=10G --filename=/mnt/fio --readwrite=write --ramp_time=10s --runtime=60s --numjobs=8 --blocksize=2k --iodepth=256 --allow_file_create=0 v6.2: write: IOPS=4328, BW=8724KiB/s v6.2 (patched): write: IOPS=801k, BW=1565MiB/s Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230314130759.642710-1-bfoster@redhat.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Theodore Ts'o authored
Line wrap and slightly clarify the comments describing mballoc's cirtiera. Define EXT4_MB_NUM_CRS as part of the enum, so that it will automatically get updated when criteria is added or removed. Also fix a potential unitialized use of 'cr' variable if CONFIG_EXT4_DEBUG is enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
After ext4_es_insert_extent() returns void, the return value in ext4_zeroout_es() is also unnecessary, so make it return void too. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-13-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Now ext4_es_insert_extent() never return error, so make it return void. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-12-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Now it never fails when inserting a delay extent, so the return value in ext4_es_insert_delayed_block is no longer necessary, let it return void. [ Fixed bug which caused system hangs during bigalloc test runs. See https://lore.kernel.org/r/20230612030405.GH1436857@mit.edu for more details. -- TYT ] Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-11-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Now ext4_es_remove_extent() never fails, so make it return void. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-10-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Similar to in ext4_es_insert_delayed_block(), we use preallocations that do not fail to avoid inconsistencies, but we do not care about es that are not must be kept, and we return 0 even if such es memory allocation fails. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-9-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Similar to in ext4_es_remove_extent(), we use a no-fail preallocation to avoid inconsistencies, except that here we may have to preallocate two extent_status. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-8-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
If __es_remove_extent() returns an error it means that when splitting extent, allocating an extent that must be kept failed, where returning an error directly would cause the extent tree to be inconsistent. So we use GFP_NOFAIL to pre-allocate an extent_status and pass it to __es_remove_extent() to avoid this problem. In addition, since the allocated memory is outside the i_es_lock, the extent_status tree may change and the pre-allocated extent_status is no longer needed, so we release the pre-allocated extent_status when es->es_len is not initialized. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-7-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
When splitting extent, if the second extent can not be dropped, we return -ENOMEM and use GFP_NOFAIL to preallocate an extent_status outside of i_es_lock and pass it to __es_remove_extent() to be used as the second extent. This ensures that __es_remove_extent() is executed successfully, thus ensuring consistency in the extent status tree. If the second extent is not undroppable, we simply drop it and return 0. Then retry is no longer necessary, remove it. Now, __es_remove_extent() will always remove what it should, maybe more. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-6-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Pass a extent_status pointer prealloc to __es_insert_extent(). If the pointer is non-null, it is used directly when a new extent_status is needed to avoid memory allocation failures. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-5-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
Factor out __es_alloc_extent() and __es_free_extent(), which only allocate and free extent_status in these two helpers. The ext4_es_alloc_extent() function is split into __es_alloc_extent() and ext4_es_init_extent(). In __es_alloc_extent() we allocate memory using GFP_KERNEL | __GFP_NOFAIL | __GFP_ZERO if the memory allocation cannot fail, otherwise we use GFP_ATOMIC. and the ext4_es_init_extent() is used to initialize extent_status and update related variables after a successful allocation. This is to prepare for the use of pre-allocated extent_status later. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-4-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
In the extent status tree, we have extents which we can just drop without issues and extents we must not drop - this depends on the extent's status - currently ext4_es_is_delayed() extents must stay, others may be dropped. A helper function is added to help determine if the current extent can be dropped, although only ext4_es_is_delayed() extents cannot be dropped currently. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-3-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Baokun Li authored
In our fault injection test, we create an ext4 file, migrate it to non-extent based file, then punch a hole and finally trigger a WARN_ON in the ext4_da_update_reserve_space(): EXT4-fs warning (device sda): ext4_da_update_reserve_space:369: ino 14, used 11 with only 10 reserved data blocks When writing back a non-extent based file, if we enable delalloc, the number of reserved blocks will be subtracted from the number of blocks mapped by ext4_ind_map_blocks(), and the extent status tree will be updated. We update the extent status tree by first removing the old extent_status and then inserting the new extent_status. If the block range we remove happens to be in an extent, then we need to allocate another extent_status with ext4_es_alloc_extent(). use old to remove to add new |----------|------------|------------| old extent_status The problem is that the allocation of a new extent_status failed due to a fault injection, and __es_shrink() did not get free memory, resulting in a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM, we map to the same extent again, and the number of reserved blocks is again subtracted from the number of blocks in that extent. Since the blocks in the same extent are subtracted twice, we end up triggering WARN_ON at ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks. For non-extent based file, we update the number of reserved blocks after ext4_ind_map_blocks() is executed, which causes a problem that when we call ext4_ind_map_blocks() to create a block, it doesn't always create a block, but we always reduce the number of reserved blocks. So we move the logic for updating reserved blocks to ext4_ind_map_blocks() to ensure that the number of reserved blocks is updated only after we do succeed in allocating some new blocks. Fixes: 5f634d06 ("ext4: Fix quota accounting error with fallocate") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-2-libaokun1@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
mballoc criterias have historically been called by numbers like CR0, CR1... however this makes it confusing to understand what each criteria is about. Change these criterias from numbers to symbolic names and add relevant comments. While we are at it, also reformat and add some comments to ext4_seq_mb_stats_show() for better readability. Additionally, define CR_FAST which signifies the criteria below which we can make quicker decisions like: * quitting early if (free block < requested len) * avoiding to scan free extents smaller than required len. * avoiding to initialize buddy cache and work with existing cache * limiting prefetches Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/a2dc6ec5aea5e5e68cf8e788c2a964ffead9c8b0.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
CR1_5 aims to optimize allocations which can't be satisfied in CR1. The fact that we couldn't find a group in CR1 suggests that it would be difficult to find a continuous extent to compleltely satisfy our allocations. So before falling to the slower CR2, in CR1.5 we proactively trim the the preallocations so we can find a group with (free / fragments) big enough. This speeds up our allocation at the cost of slightly reduced preallocation. The patch also adds a new sysfs tunable: * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order This controls how much CR1.5 can trim a request before falling to CR2. For example, for a request of order 7 and max trim order 2, CR1.5 can trim this upto order 5. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/150fdf65c8e4cc4dba71e020ce0859bcf636a5ff.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
Make the logic of searching average fragment list of a given order reusable by abstracting it out to a differnet function. This will also avoid code duplication in upcoming patches. No functional changes. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/028c11d95b17ce0285f45456709a0ca922df1b83.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
Before this patch, the call stack in ext4_run_li_request is as follows: /* * nr = no. of BGs we want to fetch (=s_mb_prefetch) * prefetch_ios = no. of BGs not uptodate after * ext4_read_block_bitmap_nowait() */ next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios); ext4_mb_prefetch_fini(sb, next_group prefetch_ios); ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in range [next_group - prefetch_ios, next_group). This is incorrect since sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to incorrectly ignore some of the BGs that might need initialization. This issue is more notable now with the previous patch enabling "fetching" of BLOCK_UNINIT BGs which are marked buffer_uptodate by default. Fix this by passing nr to ext4_mb_prefetch_fini() instead of prefetch_ios so that it considers the right range of groups. Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT groups that would need buddy initialization. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/05e648ae04ec5b754207032823e9c1de9a54f87a.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO. As a consequence, we end not initializing the buddy structures and CR0/1 lists for these BGs, even though it can be done without any disk IO overhead. Hence, don't skip such BGs during prefetch and prefetch_fini. This improves the accuracy of CR0/1 allocation as earlier, we could have essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy not being initialized, leading to slower CR2 allocations. With this patch CR0/1 will be able to discover these groups as well, thus improving performance. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/dc3130b8daf45ffe63d8a3c1edcf00eb8ba70e1f.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
When we are inside ext4_mb_complex_scan_group() in CR1, we can be sure that this group has atleast 1 big enough continuous free extent to satisfy our request because (free / fragments) > goal length. Hence, instead of wasting time looping over smaller free extents, only try to consider the free extent if we are sure that it has enough continuous free space to satisfy goal length. This is particularly useful when scanning highly fragmented BGs in CR1 as, without this patch, the allocator might stop scanning early before reaching the big enough free extent (due to ac_found > mb_max_to_scan) which causes us to uncessarily trim the request. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/a5473df4517c53ec940bc9b603ef83a547032a32.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
Track number of allocations where the length of blocks allocated is equal to the length of goal blocks (post normalization). This metric could be useful if making changes to the allocator logic in the future as it could give us visibility into how often do we trim our requests. PS: ac_b_ex.fe_len might get modified due to preallocation efforts and hence we use ac_f_ex.fe_len instead since we want to compare how much the allocator was able to actually find. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/343620e2be8a237239ea2613a7a866ee8607e973.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
This gives better visibility into the number of extents scanned in each particular CR. For example, this information can be used to see how out block group scanning logic is performing when the BG is fragmented. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/55bb6d80f6e22ed2a5a830aa045572bdffc8b1b9.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
Convert criteria to be an enum so it easier to maintain and update the tracefiles to use enum names. This change also makes it easier to insert new criterias in the future. There is no functional change in this patch. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/5d82fd467bdf70ea45bdaef810af3b146013946c.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ritesh Harjani authored
ext4_mb_stats & ext4_mb_max_to_scan are never used. We use sbi->s_mb_stats and sbi->s_mb_max_to_scan instead. Hence kill these extern declarations. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/928b3142062172533b6d1b5a94de94700590fef3.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ritesh Harjani authored
There will be changes coming in future patches which will introduce a new criteria for block allocation. This removes the useless setting of ac_criteria. AFAIU, this might be only used to differentiate between whether a preallocated blocks was allocated or was regular allocator called for allocating blocks. Hence this also adds the debug prints to identify what type of block allocation was done in ext4_mb_show_ac(). Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1dbae05617519cb6202f1b299c9d1be3e7cda763.1685449706.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Kemeng Shi authored
Function ext4_free_blocks_simple needs count in cluster. Function ext4_free_blocks accepts count in block. Convert count to cluster to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-12-shikemeng@huaweicloud.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-