- 15 Apr, 2024 40 commits
-
-
Darrick J. Wong authored
Check the owner field of dabtree node blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Check the owner field of xattr remote value blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create a leaf block header checking function to validate the owner field of xattr leaf blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Reduce the indentation here so that we can add some things in the next patch without going over the column limits. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
When we're creating leaf, data, freespace, or dabtree blocks for directories and xattrs, use the explicit owner field (instead of the xfs_inode) to set the owner field. This will enable online repair to construct replacement data structures in a temporary file without having to change the owner fields prior to swapping the new and old structures. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add an explicit owner field to xfs_da_args, which will make it easier for online fsck to set the owner field of the temporary directory and xattr structures that it builds to repair damaged metadata. Note: I hopefully found all the xfs_da_args definitions by looking for automatic stack variable declarations and xfs_da_args.dp assignments: git grep -E '(args.*dp =|struct xfs_da_args[[:space:]]*[a-z0-9][a-z0-9]*)' Note that callers of xfs_attr_{get,set,change} can set the owner to zero (or leave it unset) to have the default set to args->dp. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Repair the realtime summary data by constructing a new rtsummary file in the scrub temporary file, then atomically swapping the contents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create some new routines to exchange the contents of a temporary file created to stage a repair with another ondisk file. This will be used by the realtime summary repair function to commit atomically the new rtsummary data, which will be staged in the tempfile. The rest of XFS coordinates access to the realtime metadata inodes solely through the ILOCK. For repair to hold its exclusive access to the realtime summary file, it has to allocate a single large transaction and roll it repeatedly throughout the repair while holding the ILOCK. In turn, this means that for now there's only a partial file mapping exchange implementation for the temporary file because we can only work within an existing transaction. For now, the only tempswap functions needed here are to estimate the resource requirements of the exchange, reserve more space/quota to an existing transaction, and kick off the actual exchange. The rest will be added in a later patch in preparation for repairing xattrs and directories. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create the routines we need to preallocate space in a temporary ondisk file and then copy the contents of an xfile into the tempfile. The upcoming rtsummary repair feature will construct the contents of a realtime summary file in memory, after which it will want to copy all that into the ondisk temporary file before atomically committing the new rtsummary contents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
In preparation for supporting repair of indexed file-based metadata (such as realtime bitmaps, directories, and extended attribute data), add a function to reap the old blocks after a metadata repair finishes. IOWs, this is an elaborate bunmapi call that deals with crosslinked blocks by unmapping them without freeing them, and also scans for incore buffers to invalidate. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
In an upcoming patch, we will need to be able to look for xfs_buf objects caching file-based metadata blocks without needing to walk the (possibly corrupt) structures to find all the buffers. Repair already has most of the code needed to scan the buffer cache, so hoist these utility functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Teach the online repair code how to create temporary files or directories. These temporary files can be used to stage reconstructed information until we're ready to perform an atomic extent swap to commit the new metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
We're about to start adding functionality that uses internal inodes that are private to XFS. What this means is that userspace should never be able to access any information about these files, and should not be able to open these files by handle. To prevent users from ever finding the file or mis-interactions with the security apparatus, set S_PRIVATE on the inode. Don't allow bulkstat, open-by-handle, or linking of S_PRIVATE files into the directory tree. This should keep private inodes actually private. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add the XFS_SB_FEAT_INCOMPAT_EXCHRANGE feature to the set of features that we will permit when mounting a filesystem. This turns on support for the file range exchange feature. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Start reworking the atomic swapext design documentation to refer to its new file contents/mapping exchange name. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Per some very late review comments, capture the generation numbers of both inodes involved in a file content exchange operation so that we don't accidentally target files with have been reallocated. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The generic exchange-range alignment checks use (fast) bitmasking operations to perform block alignment checks on the exchange parameters. Unfortunately, bitmasks require that the alignment size be a power of two. This isn't true for realtime devices with a non-power-of-two extent size, so we have to copy-pasta the generic checks using long division for this to work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Now that bmap items support the realtime device, we can add the necessary pieces to the file range exchange code to support exchanging mappings. All we really need to do here is adjust the blockcount upwards to the end of the rt extent and remove the inode checks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The previous commit added a new file mapping exchange flag that enables us to perform post-exchange processing on file2 once we're done exchanging the extent mappings. Now add this ability for symlinks. This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online symlink repair feature can salvage the remote target in a temporary link and exchange the data fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed symlink down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The previous commit added a new file mapping exchange flag that enables us to perform post-swap processing on file2 once we're done exchanging extent mappings. Now add this ability for directories. This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online directory repair feature can create salvaged dirents in a temporary directory and exchange the data fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed directory down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add a new file mapping exchange flag that enables us to perform post-exchange processing on file2 once we're done exchanging the extent mappings. If we were swapping mappings between extended attribute forks, we want to be able to convert file2's attr fork from block to inline format. (This implies that all fork contents are exchanged.) This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online xattr repair feature can create salvaged attrs in a temporary file and exchange the attr fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed file's attr fork back down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add an errortag so that we can test recovery of exchmaps log items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
So far, we've constructed the front end of the file range exchange code that does all the checking; and the back end of the file mapping exchange code that actually does the work. Glue these two pieces together so that we can turn on the functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Now that we've created the skeleton of a log intent item to track and restart file mapping exchange operations, add the upper level logic to commit intent items and turn them into concrete work recorded in the log. This builds on the existing bmap update intent items that have been around for a while now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Introduce a new intent log item to handle exchanging mappings between the forks of two files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create a incompat flag so that we only attempt to process file mapping exchange log items if the filesystem supports it, and a geometry flag to advertise support if it's present. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Introduce a new ioctl to handle exchanging ranges of bytes between files. The goal here is to perform the exchange atomically with respect to applications -- either they see the file contents before the exchange or they see that A-B is now B-A, even if the kernel crashes. My original goal with all this code was to make it so that online repair can build a replacement directory or xattr structure in a temporary file and commit the repair by atomically exchanging all the data blocks between the two files. However, I needed a way to test this mechanism thoroughly, so I've been evolving an ioctl interface since then. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Export these functions so that the next patch can use them to check the file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation. Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
This predicate doesn't modify the structure that's being passed in, so we can mark it const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create a helper function that can compute if a 64-bit number is an integer multiple of a 32-bit number, where the 32-bit number is not required to be an even power of two. This is needed for some new code for the realtime device, where we can set 37k allocation units and then have to remap them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Replace the open-coded logic to decide if a file has a multi-fsb allocation unit to a helper to make the code easier to read. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create a new helper function to calculate the fundamental allocation unit (i.e. the smallest unit of space we can allocate) of a file. Things are going to get hairy with range-exchange on the realtime device, so prepare for this now. Remove the static attribute from xfs_is_falloc_aligned since the next patch will need it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Move the two public symbols in xfs_file.c to xfs_file.h. We're about to add more public symbols in that source file, so let's finally create the header file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Similarly, move declarations of public symbols of xfs_iops.c from xfs_inode.h to xfs_iops.h. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The lease breaking functions operate at the scope of the entire VFS inode, not subranges of a file. Move them to xfs_inode.c since they're already declared in xfs_inode.h. This cleanup moves us closer to having xfs_FOO.h declare only the symbols in xfs_FOO.c. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
While reviewing the online fsck patchset, someone spied the xfs_swapext_can_use_without_log_assistance function and wondered why we go through this inverted-bitmask dance to avoid setting the XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature. (The same principles apply to the logged extended attribute update feature bit in the since-merged LARP series.) The reason for this dance is that xfs_add_incompat_log_feature is an expensive operation -- it forces the log, pushes the AIL, and then if nobody's beaten us to it, sets the feature bit and issues a synchronous write of the primary superblock. That could be a one-time cost amortized over the life of the filesystem, but the log quiesce and cover operations call xfs_clear_incompat_log_features to remove feature bits opportunistically. On a moderately loaded filesystem this leads to us cycling those bits on and off over and over, which hurts performance. Why do we clear the log incompat bits? Back in ~2020 I think Dave and I had a conversation on IRC[2] about what the log incompat bits represent. IIRC in that conversation we decided that the log incompat bits protect unrecovered log items so that old kernels won't try to recover them and barf. Since a clean log has no protected log items, we could clear the bits at cover/quiesce time. As Dave Chinner pointed out in the thread, clearing log incompat bits at unmount time has positive effects for golden root disk image generator setups, since the generator could be running a newer kernel than what gets written to the golden image -- if there are log incompat fields set in the golden image that was generated by a newer kernel/OS image builder then the provisioning host cannot mount the filesystem even though the log is clean and recovery is unnecessary to mount the filesystem. Given that it's expensive to set log incompat bits, we really only want to do that once per bit per mount. Therefore, I propose that we only clear log incompat bits as part of writing a clean unmount record. Do this by adding an operational state flag to the xfs mount that guards whether or not the feature bit clearing can actually take place. This eliminates the l_incompat_users rwsem that we use to protect a log cleaning operation from clearing a feature bit that a frontend thread is trying to set -- this lock adds another way to fail w.r.t. locking. For the swapext series, I shard that into multiple locks just to work around the lockdep complaints, and that's fugly. Link: https://lore.kernel.org/linux-xfs/20240131230043.GA6180@frogsfrogsfrogs/Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
Dan Carpenter reports: "Commit 4bdfd7d1 ("xfs: repair free space btrees") from Dec 15, 2023 (linux-next), leads to the following Smatch static checker warning: fs/xfs/scrub/alloc_repair.c:781 xrep_abt_build_new_trees() warn: missing unwind goto?" That's a bug, so let's fix it. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 4bdfd7d1 ("xfs: repair free space btrees") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
xfs/399 found the following deadlock when fuzzing core.mode = ones: /proc/20506/task/20558/stack : [<0>] xfs_ilock+0xa0/0x240 [xfs] [<0>] xfs_ilock_data_map_shared+0x1b/0x20 [xfs] [<0>] xrep_dinode_findmode_walk_directory+0x69/0xe0 [xfs] [<0>] xrep_dinode_find_mode+0x103/0x2a0 [xfs] [<0>] xrep_dinode_mode+0x7c/0x120 [xfs] [<0>] xrep_dinode_core+0xed/0x2b0 [xfs] [<0>] xrep_dinode_problems+0x10/0x80 [xfs] [<0>] xrep_inode+0x6c/0xc0 [xfs] [<0>] xrep_attempt+0x64/0x1d0 [xfs] [<0>] xfs_scrub_metadata+0x365/0x840 [xfs] [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs] [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs] [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs] /proc/20506/task/20559/stack : [<0>] xfs_buf_lock+0x3b/0x110 [xfs] [<0>] xfs_buf_find_lock+0x66/0x1c0 [xfs] [<0>] xfs_buf_get_map+0x208/0xc00 [xfs] [<0>] xfs_buf_read_map+0x5d/0x2c0 [xfs] [<0>] xfs_trans_read_buf_map+0x1b0/0x4c0 [xfs] [<0>] xfs_read_agi+0xbd/0x190 [xfs] [<0>] xfs_ialloc_read_agi+0x47/0x160 [xfs] [<0>] xfs_imap_lookup+0x69/0x1f0 [xfs] [<0>] xfs_imap+0x1fc/0x3d0 [xfs] [<0>] xfs_iget+0x357/0xd50 [xfs] [<0>] xchk_dir_actor+0x16e/0x330 [xfs] [<0>] xchk_dir_walk_block+0x164/0x1e0 [xfs] [<0>] xchk_dir_walk+0x13a/0x190 [xfs] [<0>] xchk_directory+0x1a2/0x2b0 [xfs] [<0>] xfs_scrub_metadata+0x2f4/0x840 [xfs] [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs] [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs] [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs] Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the root directory. Thread 20559 holds the root directory ILOCK and is trying to grab the AGI of an inode that is one of the root directory's children. The AGI held by 20558 is the same buffer that 20559 is trying to acquire. In other words, this is an ABBA deadlock. In general, the lock order is ILOCK and then AGI -- rename does this while preparing for an operation involving whiteouts or renaming files out of existence; and unlink does this when moving an inode to the unlinked list. The only place where we do it in the opposite order is on the child during an icreate, but at that point the child is marked INEW and is not visible to other threads. Work around this deadlock by replacing the blocking ilock attempt with a nonblocking loop that aborts after 30 seconds. Relax for a jiffy after a failed lock attempt. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
While reviewing the next patch which fixes an ABBA deadlock between the AGI and a directory ILOCK, someone asked a question about why we're holding the AGI in the first place. The reason for that is to quiesce the inode structures for that AG while we do a repair. I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter, which walks the inobts (and hence the AGIs) to find all the inodes. This itself is also an ABBA vector, since the damaged inode could be in AG 5, which we hold while we scan AG 0 for directories. 5 -> 0 is not allowed. To address this, modify the iscan to allow trylock of the AGI buffer using the flags argument to xfs_ialloc_read_agi that the previous patch added. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Allow callers to pass buffer lookup flags to xfs_read_agi and xfs_ialloc_read_agi. This will be used in the next patch to fix a deadlock in the online fsck inode scanner. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-