1. 01 Oct, 2022 18 commits
  2. 30 Sep, 2022 8 commits
  3. 29 Sep, 2022 4 commits
  4. 27 Sep, 2022 2 commits
    • Eric Whitney's avatar
      ext4: minor defrag code improvements · d412df53
      Eric Whitney authored
      Modify the error returns for two file types that can't be defragged to
      more clearly communicate those restrictions to a caller.  When the
      defrag code is applied to swap files, return -ETXTBSY, and when applied
      to quota files, return -EOPNOTSUPP.  Move an extent tree search whose
      results are only occasionally required to the site always requiring them
      for improved efficiency.  Address a few typos.
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Link: https://lore.kernel.org/r/20220722163910.268564-1-enwlinux@gmail.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d412df53
    • Jerry Lee 李修賢's avatar
      ext4: continue to expand file system when the target size doesn't reach · df3cb754
      Jerry Lee 李修賢 authored
      When expanding a file system from (16TiB-2MiB) to 18TiB, the operation
      exits early which leads to result inconsistency between resize2fs and
      Ext4 kernel driver.
      
      === before ===
      ○ → resize2fs /dev/mapper/thin
      resize2fs 1.45.5 (07-Jan-2020)
      Filesystem at /dev/mapper/thin is mounted on /mnt/test; on-line resizing required
      old_desc_blocks = 2048, new_desc_blocks = 2304
      The filesystem on /dev/mapper/thin is now 4831837696 (4k) blocks long.
      
      [  865.186308] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
      [  912.091502] dm-4: detected capacity change from 34359738368 to 38654705664
      [  970.030550] dm-5: detected capacity change from 34359734272 to 38654701568
      [ 1000.012751] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks
      [ 1000.012878] EXT4-fs (dm-5): resized filesystem to 4294967296
      
      === after ===
      [  129.104898] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
      [  143.773630] dm-4: detected capacity change from 34359738368 to 38654705664
      [  198.203246] dm-5: detected capacity change from 34359734272 to 38654701568
      [  207.918603] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks
      [  207.918754] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks
      [  207.918758] EXT4-fs (dm-5): Converting file system to meta_bg
      [  207.918790] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks
      [  221.454050] EXT4-fs (dm-5): resized to 4658298880 blocks
      [  227.634613] EXT4-fs (dm-5): resized filesystem to 4831837696
      Signed-off-by: default avatarJerry Lee <jerrylee@qnap.com>
      Link: https://lore.kernel.org/r/PU1PR04MB22635E739BD21150DC182AC6A18C9@PU1PR04MB2263.apcprd04.prod.outlook.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      df3cb754
  5. 26 Sep, 2022 1 commit
  6. 22 Sep, 2022 7 commits
    • Theodore Ts'o's avatar
      ext4: limit the number of retries after discarding preallocations blocks · 80fa46d6
      Theodore Ts'o authored
      This patch avoids threads live-locking for hours when a large number
      threads are competing over the last few free extents as they blocks
      getting added and removed from preallocation pools.  From our bug
      reporter:
      
         A reliable way for triggering this has multiple writers
         continuously write() to files when the filesystem is full, while
         small amounts of space are freed (e.g. by truncating a large file
         -1MiB at a time). In the local filesystem, this can be done by
         simply not checking the return code of write (0) and/or the error
         (ENOSPACE) that is set. Over NFS with an async mount, even clients
         with proper error checking will behave this way since the linux NFS
         client implementation will not propagate the server errors [the
         write syscalls immediately return success] until the file handle is
         closed. This leads to a situation where NFS clients send a
         continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
         since the client isn't seeing this, the stream of writes continues
         at maximum network speed.
      
         When some space does appear, multiple writers will all attempt to
         claim it for their current write. For NFS, we may see dozens to
         hundreds of threads that do this.
      
         The real-world scenario of this is database backup tooling (in
         particular, github.com/mdkent/percona-xtrabackup) which may write
         large files (>1TiB) to NFS for safe keeping. Some temporary files
         are written, rewound, and read back -- all before closing the file
         handle (the temp file is actually unlinked, to trigger automatic
         deletion on close/crash.) An application like this operating on an
         async NFS mount will not see an error code until TiB have been
         written/read.
      
         The lockup was observed when running this database backup on large
         filesystems (64 TiB in this case) with a high number of block
         groups and no free space. Fragmentation is generally not a factor
         in this filesystem (~thousands of large files, mostly contiguous
         except for the parts written while the filesystem is at capacity.)
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      80fa46d6
    • Luís Henriques's avatar
      ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0 · 29a5b8a1
      Luís Henriques authored
      When walking through an inode extents, the ext4_ext_binsearch_idx() function
      assumes that the extent header has been previously validated.  However, there
      are no checks that verify that the number of entries (eh->eh_entries) is
      non-zero when depth is > 0.  And this will lead to problems because the
      EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:
      
      [  135.245946] ------------[ cut here ]------------
      [  135.247579] kernel BUG at fs/ext4/extents.c:2258!
      [  135.249045] invalid opcode: 0000 [#1] PREEMPT SMP
      [  135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4
      [  135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
      [  135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
      [  135.256475] Code:
      [  135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
      [  135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
      [  135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
      [  135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
      [  135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
      [  135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
      [  135.272394] FS:  00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
      [  135.274510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
      [  135.277952] Call Trace:
      [  135.278635]  <TASK>
      [  135.279247]  ? preempt_count_add+0x6d/0xa0
      [  135.280358]  ? percpu_counter_add_batch+0x55/0xb0
      [  135.281612]  ? _raw_read_unlock+0x18/0x30
      [  135.282704]  ext4_map_blocks+0x294/0x5a0
      [  135.283745]  ? xa_load+0x6f/0xa0
      [  135.284562]  ext4_mpage_readpages+0x3d6/0x770
      [  135.285646]  read_pages+0x67/0x1d0
      [  135.286492]  ? folio_add_lru+0x51/0x80
      [  135.287441]  page_cache_ra_unbounded+0x124/0x170
      [  135.288510]  filemap_get_pages+0x23d/0x5a0
      [  135.289457]  ? path_openat+0xa72/0xdd0
      [  135.290332]  filemap_read+0xbf/0x300
      [  135.291158]  ? _raw_spin_lock_irqsave+0x17/0x40
      [  135.292192]  new_sync_read+0x103/0x170
      [  135.293014]  vfs_read+0x15d/0x180
      [  135.293745]  ksys_read+0xa1/0xe0
      [  135.294461]  do_syscall_64+0x3c/0x80
      [  135.295284]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This patch simply adds an extra check in __ext4_ext_check(), verifying that
      eh_entries is not 0 when eh_depth is > 0.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
      Cc: Baokun Li <libaokun1@huawei.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLuís Henriques <lhenriques@suse.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarBaokun Li <libaokun1@huawei.com>
      Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.deSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      29a5b8a1
    • Jan Kara's avatar
      ext4: use buckets for cr 1 block scan instead of rbtree · 83e80a6e
      Jan Kara authored
      Using rbtree for sorting groups by average fragment size is relatively
      expensive (needs rbtree update on every block freeing or allocation) and
      leads to wide spreading of allocations because selection of block group
      is very sentitive both to changes in free space and amount of blocks
      allocated. Furthermore selecting group with the best matching average
      fragment size is not necessary anyway, even more so because the
      variability of fragment sizes within a group is likely large so average
      is not telling much. We just need a group with large enough average
      fragment size so that we have high probability of finding large enough
      free extent and we don't want average fragment size to be too big so
      that we are likely to find free extent only somewhat larger than what we
      need.
      
      So instead of maintaing rbtree of groups sorted by fragment size keep
      bins (lists) or groups where average fragment size is in the interval
      [2^i, 2^(i+1)). This structure requires less updates on block allocation
      / freeing, generally avoids chaotic spreading of allocations into block
      groups, and still is able to quickly (even faster that the rbtree)
      provide a block group which is likely to have a suitably sized free
      space extent.
      
      This patch reduces number of block groups used when untarring archive
      with medium sized files (size somewhat above 64k which is default
      mballoc limit for avoiding locality group preallocation) to about half
      and thus improves write speeds for eMMC flash significantly.
      
      Fixes: 196e402a ("ext4: improve cr 0 / cr 1 group scanning")
      CC: stable@kernel.org
      Reported-and-tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarOjaswin Mujoo <ojaswin@linux.ibm.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarRitesh Harjani (IBM) <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
      Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.czSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      83e80a6e
    • Jan Kara's avatar
      ext4: use locality group preallocation for small closed files · a9f2a293
      Jan Kara authored
      Curently we don't use any preallocation when a file is already closed
      when allocating blocks (from writeback code when converting delayed
      allocation). However for small files, using locality group preallocation
      is actually desirable as that is not specific to a particular file.
      Rather it is a method to pack small files together to reduce
      fragmentation and for that the fact the file is closed is actually even
      stronger hint the file would benefit from packing. So change the logic
      to allow locality group preallocation in this case.
      
      Fixes: 196e402a ("ext4: improve cr 0 / cr 1 group scanning")
      CC: stable@kernel.org
      Reported-and-tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarOjaswin Mujoo <ojaswin@linux.ibm.com>
      Reviewed-by: default avatarRitesh Harjani (IBM) <ritesh.list@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
      Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.czSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a9f2a293
    • Jan Kara's avatar
      ext4: make directory inode spreading reflect flexbg size · 613c5a85
      Jan Kara authored
      Currently the Orlov inode allocator searches for free inodes for a
      directory only in flex block groups with at most inodes_per_group/16
      more directory inodes than average per flex block group. However with
      growing size of flex block group this becomes unnecessarily strict.
      Scale allowed difference from average directory count per flex block
      group with flex block group size as we do with other metrics.
      Tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarOjaswin Mujoo <ojaswin@linux.ibm.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.czSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      613c5a85
    • Jan Kara's avatar
      ext4: avoid unnecessary spreading of allocations among groups · 1940265e
      Jan Kara authored
      mb_set_largest_free_order() updates lists containing groups with largest
      chunk of free space of given order. The way it updates it leads to
      always moving the group to the tail of the list. Thus allocations
      looking for free space of given order effectively end up cycling through
      all groups (and due to initialization in last to first order). This
      spreads allocations among block groups which reduces performance for
      rotating disks or low-end flash media. Change
      mb_set_largest_free_order() to only update lists if the order of the
      largest free chunk in the group changed.
      
      Fixes: 196e402a ("ext4: improve cr 0 / cr 1 group scanning")
      CC: stable@kernel.org
      Reported-and-tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarOjaswin Mujoo <ojaswin@linux.ibm.com>
      Reviewed-by: default avatarRitesh Harjani (IBM) <ritesh.list@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
      Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.czSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1940265e
    • Jan Kara's avatar
      ext4: make mballoc try target group first even with mb_optimize_scan · 4fca50d4
      Jan Kara authored
      One of the side-effects of mb_optimize_scan was that the optimized
      functions to select next group to try were called even before we tried
      the goal group. As a result we no longer allocate files close to
      corresponding inodes as well as we don't try to expand currently
      allocated extent in the same group. This results in reaim regression
      with workfile.disk workload of upto 8% with many clients on my test
      machine:
      
                           baseline               mb_optimize_scan
      Hmean     disk-1       2114.16 (   0.00%)     2099.37 (  -0.70%)
      Hmean     disk-41     87794.43 (   0.00%)    83787.47 *  -4.56%*
      Hmean     disk-81    148170.73 (   0.00%)   135527.05 *  -8.53%*
      Hmean     disk-121   177506.11 (   0.00%)   166284.93 *  -6.32%*
      Hmean     disk-161   220951.51 (   0.00%)   207563.39 *  -6.06%*
      Hmean     disk-201   208722.74 (   0.00%)   203235.59 (  -2.63%)
      Hmean     disk-241   222051.60 (   0.00%)   217705.51 (  -1.96%)
      Hmean     disk-281   252244.17 (   0.00%)   241132.72 *  -4.41%*
      Hmean     disk-321   255844.84 (   0.00%)   245412.84 *  -4.08%*
      
      Also this is causing huge regression (time increased by a factor of 5 or
      so) when untarring archive with lots of small files on some eMMC storage
      cards.
      
      Fix the problem by making sure we try goal group first.
      
      Fixes: 196e402a ("ext4: improve cr 0 / cr 1 group scanning")
      CC: stable@kernel.org
      Reported-and-tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarOjaswin Mujoo <ojaswin@linux.ibm.com>
      Reviewed-by: default avatarRitesh Harjani (IBM) <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
      Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.czSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4fca50d4