- 27 Apr, 2013 12 commits
-
-
Dave Chinner authored
The buffer type passed to log recvoery in the buffer log item overruns the blf_flags field. I had assumed that flags field was a 32 bit value, and it turns out it is a unisgned short. Therefore having 19 flags doesn't really work. Convert the buffer type field to numeric value, and use the top 5 bits of the flags field for it. We currently have 17 types of buffers, so using 5 bits gives us plenty of room for expansion in future.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Add buffer types to the buffer log items so that log recovery can validate the buffers and calculate CRCs correctly after the buffers are recovered. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
There are two ways of doing this - the first is to add a CRC to the remote attribute entry in the attribute block. The second is to treat them similar to the remote symlink, where each fragment has it's own header and identifies fragment location in the attribute. The problem with the CRC in the remote attr entry is that we cannot identify the owner of the metadata from the metadata blocks themselves, or where the blocks fit into the remote attribute. The down side to this approach is that we never know when the attribute has been read from disk or not and so we have to verify it every time it is read, and we must calculate it during the create transaction and log it. We do not log CRCs for any other metadata, and so this creates a unique set of coherency problems that, in general, are best avoided. Adding an identifying header to each allocated block allows us to identify each fragment and where in the attribute it is located. It enables us to rebuild the remote attribute from just the raw blocks containing the attribute. It also provides us to do per-block CRCs verification at IO time rather than during the transaction context that creates it or every time it is read into a user buffer. Hence it avoids all the problems that an external, logged CRC has, and provides all the benefits of self identifying metadata. The only complexity is that we have to add a header per fragment, and we don't know how many fragments will be needed prior to allocations. If we take the symlink example, the header is 56 bytes and hence for a 4k block size filesystem, in the worst case 16 headers requires 1 extra block for the 64k attribute data. For 512 byte filesystems the worst case is an extra block for every 9 fragments (i.e. 16 extra blocks in the worse case). This will be very rare and so it's not really a major concern. Because allocation is done in two steps - the first finds a hole large enough in the attribute file, the second does the allocation - we only need to find a hole big enough for a worst case allocation. We only need to allocate enough extra blocks for number of headers required by the fragments, and we can calculate that as we go.... Hence it really only makes sense to use the same model as for symlinks - it doesn't add that much complexity, does not require an attribute tree format change, and does not require logging calculated CRC values. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Adding CRC support to remote attributes adds a significant amount of remote attribute specific code. Split the existing remote attribute code out into it's own file so that all the relevant remote attribute code is in a single, easy to find place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Because the header size for the CRC enabled directory blocks is larger, the offset of the first entry into a directory block is different to the dir2 format. The shortform directory stores the dirent's offset so that it doesn't change when moving from shortform to block form and back again, and hence it needs to take into account the different header sizes to maintain the correct offsets. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
This addition follows the same pattern as the dir2 block CRCs. Seeing as both LEAF1 and LEAFN types need to changed at the same time, this is a pretty large amount of change. leaf block headers need to be abstracted away from the on-disk structures (struct xfs_dir3_icleaf_hdr), as do the base leaf entry locations. This header abstract allows the in-core header and leaf entry location to be passed around instead of the leaf block itself. This saves a lot of converting individual variables from on-disk format to host format where they are used, so there's a good chance that the compiler will be able to produce much more optimal code as it's not having to byteswap variables all over the place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
This addition follows the same pattern as the dir2 block CRCs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
This addition follows the same pattern as the dir2 block CRCs, but with a few differences. The main difference is that the free block header is different between the v2 and v3 formats, so an "in-core" free block header has been added and _todisk/_from_disk functions used to abstract the differences in structure format from the code. This is similar to the on-disk superblock versus the in-core superblock setup. The in-core strucutre is populated when the buffer is read from disk, all the in memory checks and modifications are done on the in-core version of the structure which is written back to the buffer before the buffer is logged. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Now that directory buffers are made from a single struct xfs_buf, we can add CRC calculation and checking callbacks. While there, add all the fields to the on disk structures for future functionality such as d_type support, uuids, block numbers, owner inode, etc. To distinguish between the different on disk formats, change the magic numbers for the new format directory blocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Add a header to the remote symlink block, containing location and owner information, as well as CRCs and LSN fields. This requires verifiers to be added to the remote symlink buffers for CRC enabled filesystems. This also fixes a bug reading multiple block symlinks, where the second block overwrites the first block when copying out the link name. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
- 21 Apr, 2013 8 commits
-
-
Dave Chinner authored
The symlink code is about to get more complicated when CRCs are added for remote symlink blocks. The symlink management code is mostly self contained, so move it to it's own files so that all the new code and the existing symlink code will not be intermingled with other unrelated code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Christoph Hellwig authored
Add a new inode version with a larger core. The primary objective is to allow for a crc of the inode, and location information (uuid and ino) to verify it was written in the right place. We also extend it by: a creation time (for Samba); a changecount (for NFSv4); a flush sequence (in LSN format for recovery); an additional inode flags field; and some additional padding. These additional fields are not implemented yet, but already laid out in the structure. [dchinner@redhat.com] Added LSN and flags field, some factoring and rework to capture all the necessary information in the crc calculation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Christoph Hellwig authored
Use the reserved space in struct xfs_dqblk to store a UUID and a crc for the quota blocks. [dchinner@redhat.com] Add a LSN field and update for current verifier infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Same set of changes made to the AGF need to be made to the AGI. This patch has a similar history to the AGF, hence a similar sign-off chain. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dgc@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Christoph Hellwig authored
Add CRC checks, location information and a magic number to the AGFL. Previously the AGFL was just a block containing nothing but the free block pointers. The new AGFL has a real header with the usual boilerplate instead, so that we can verify it's not corrupted and written into the right place. [dchinner@redhat.com] Added LSN field, reworked significantly to fit into new verifier structure and growfs structure, enabled full verifier functionality now there is a header to verify and we can guarantee an initialised AGFL. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
The AGF already has some self identifying fields (e.g. the sequence number) so we only need to add the uuid to it to identify the filesystem it belongs to. The location is fixed based on the sequence number, so there's no need to add a block number, either. Hence the only additional fields are the CRC and LSN fields. These are unlogged, so place some space between the end of the logged fields and them so that future expansion of the AGF for logged fields can be placed adjacent to the existing logged fields and hence not complicate the field-derived range based logging we currently have. Based originally on a patch from myself, modified further by Christoph Hellwig and then modified again to fit into the verifier structure with additional fields by myself. The multiple signed-off-by tags indicate the age and history of this patch. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Christoph Hellwig authored
Add support for larger btree blocks that contains a CRC32C checksum, a filesystem uuid and block number for detecting filesystem consistency and out of place writes. [dchinner@redhat.com] Also include an owner field to allow reverse mappings to be implemented for improved repairability and a LSN field to so that log recovery can easily determine the last modification that made it to disk for each buffer. [dchinner@redhat.com] Add buffer log format flags to indicate the type of buffer to recovery so that we don't have to do blind magic number tests to determine what the buffer is. [dchinner@redhat.com] Modified to fit into the verifier structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Currently xfs_corruption_error() dumps the first 16 bytes of the buffer that is passed to it when a corruption occurs. This is not large enough to see the entire state of the header of the block that was determined to be corrupt. increase the output to 64 bytes to capture the majority of all headers in all types of metadata blocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
- 16 Apr, 2013 2 commits
-
-
Jeff Liu authored
xfs_log_commit_iclog() function has been removed by commits 93b8a585: xfs: remove the deprecated nodelaylog option Beginning from Linux 3.3, only delayed logging is supported so that we call xfs_log_commit_cil() at xfs_trans_commit() only, remove the useless comments so. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Jeff Liu authored
There is no more users of this Macro, so it's time to kill it dead. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
- 05 Apr, 2013 1 commit
-
-
Dave Chinner authored
Filesystems are occasionally being shut down with this error: xfs_trans_ail_delete_bulk: attempting to delete a log item that is not in the AIL. It was diagnosed to be related to the EFI/EFD commit order when the EFI and EFD are in different checkpoints and the EFD is committed before the EFI here: http://oss.sgi.com/archives/xfs/2013-01/msg00082.html The real problem is that a single bit cannot fully describe the states that the EFI/EFD processing can be in. These completion states are: EFI EFI in AIL EFD Result committed/unpinned Yes committed OK committed/pinned No committed Shutdown uncommitted No committed Shutdown Note that the "result" field is what should happen, not what does happen. The current logic is broken and handles the first two cases correctly by luck. That is, the code will free the EFI if the XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The inverted logic "works" because if both EFI and EFD are committed, then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED bit, and the second frees the EFI item. Hence as long as xfs_efi_item_committed() has been called, everything appears to be fine. It is the third case where the logic fails - where xfs_efd_item_committed() is called before xfs_efi_item_committed(), and that results in the EFI being freed before it has been committed. That is the bug that triggered the shutdown, and hence keeping track of whether the EFI has been committed or not is insufficient to correctly order the EFI/EFD operations w.r.t. the AIL. What we really want is this: the EFI is always placed into the AIL before the last reference goes away. The only way to guarantee that is that the EFI is not freed until after it has been unpinned *and* the EFD has been committed. That is, restructure the logic so that the only case that can occur is the first case. This can be done easily by replacing the XFS_EFI_COMMITTED with an EFI reference count. The EFI is initialised with it's own count, and that is not released until it is unpinned. However, there is a complication to this method - the high level EFI/EFD code in xfs_bmap_finish() does not hold direct references to the EFI structure, and runs a transaction commit between the EFI and EFD processing. Hence the EFI can be freed even before the EFD is created using such a method. Further, log recovery uses the AIL for tracking EFI/EFDs that need to be recovered, but it uses the AIL *differently* to the EFI transaction commit. Hence log recovery never pins or unpins EFIs, so we can't drop the EFI reference count indirectly to free the EFI. However, this doesn't prevent us from using a reference count here. There is a 1:1 relationship between EFIs and EFDs, so when we initialise the EFI we can take a reference count for the EFD as well. This solves the xfs_bmap_finish() issue - the EFI will never be freed until the EFD is processed. In terms of log recovery, during the committing of the EFD we can look for the XFS_EFI_RECOVERED bit being set and drop the EFI reference as well, thereby ensuring everything works correctly there as well. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
- 03 Apr, 2013 1 commit
-
-
Rich Johnston authored
Ratelimited printk will be useful in printing xfs messages which are otherwise not required to be printed always due to their high rate (to prevent kernel ring buffer from overflowing), while at the same time required to be printed. Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
- 22 Mar, 2013 7 commits
-
-
Jan Kara authored
When a dirty page is truncated from a file but reclaim gets to it before truncate_inode_pages(), we hit WARN_ON(delalloc) in xfs_vm_releasepage(). This is because reclaim tries to write the page, xfs_vm_writepage() just bails out (leaving page clean) and thus reclaim thinks it can continue and calls xfs_vm_releasepage() on page with dirty buffers. Fix the issue by redirtying the page in xfs_vm_writepage(). This makes reclaim stop reclaiming the page and also logically it keeps page in a more consistent state where page with dirty buffers has PageDirty set. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
Add a tracepoint to provide some feedback on preallocation size calculation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
Introduce the need_throttle() and calc_throttle() functions to independently check whether throttling is required for a particular dquot and if so, calculate the associated throttling metrics based on the state of the quota. We use the same general algorithm to calculate the throttle shift as for global free space with the exception of using three stages rather than five. Update xfs_iomap_prealloc_size() to use the smallest available prealloc size based on each of the constraints and apply the maximum shift to obtain the throttled preallocation size. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
Enable tracking of high and low watermarks for preallocation throttling of files under quota restrictions. These values are calculated when the quota limit is read from disk or modified and cached for later use by the throttling algorithm. The high watermark specifies when preallocation is disabled, the low watermark specifies when throttling is enabled and the low free space data structure contains precalculated low free space limits to serve as input to determine the level of throttling required. Note that the low free space data structure is based on the existing global low free space data structure with the exception of using three stages (5%, 3% and 1%) rather than five to reduce the impact of xfs_dquot memory overhead. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
Modify xfs_qm_adjust_dqlimits() to take the xfs_dquot as a parameter instead of just the xfs_disk_dquot_t so we can update in-memory fields if necessary. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
The round down occurs towards the beginning of the function. Push it down after throttling has occurred. This is to support adding further transformations to 'alloc_blocks' that might not preserve power-of-two alignment (and thus could lead to rounding down multiple times). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
The majority of xfs_iomap_prealloc_size() executes within the check for lack of default I/O size. Reverse the logic to remove the extra indentation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
- 14 Mar, 2013 3 commits
-
-
Christoph Hellwig authored
Add a version argument to XFS_LITINO so that it can return different values depending on the inode version. This is required for the upcoming v3 inodes with a larger fixed layout dinode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
Failed buffer readahead can leave the buffer in the cache marked with an error. Most callers that then issue a subsequent read on the buffer do not zero the b_error field out, and so we may incorectly detect an error during IO completion due to the stale error value left on the buffer. Avoid this problem by zeroing the error before IO submission. This ensures that the only IO errors that are detected those captured from are those captured from bio submission or completion. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Jeff Liu authored
Looks the old m_inode_shrink is obsoleted as we perform inodes reclaim per AG via m_reclaim_workqueue, this patch remove it from the xfs_mount structure if so. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
- 07 Mar, 2013 6 commits
-
-
Dave Chinner authored
xfs_bmap.c is a big file, and some of the related code is spread all throughout the file requiring function prototypes for static function and jumping all through the file to follow a single call path. Rearrange the code so that: a) related functionality is grouped together; and b) functions are grouped in call dependency order While the diffstat is large, there are no code changes in the patch; it is just moving the functionality around and removing the function prototypes at the top of the file. The resulting layout of the code is as follows (top of file to bottom): - miscellaneous helper functions - extent tree block counting routines - debug/sanity checking code - bmap free list manipulation functions - inode fork format manipulation functions - internal/external extent tree seach functions - extent tree manipulation functions used during allocation - functions used during extent read/allocate/removal operations (i.e. xfs_bmapi_write, xfs_bmapi_read, xfs_bunmapi and xfs_getbmap) This means that following logic paths through the bmapi code is much simpler - most of the code relevant to a specific operation is now clustered together rather than spread all over the file.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Akinobu Mita authored
Use more preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: <bpm@sgi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: xfs@oss.sgi.com Signed-off-by: Ben Myers <bpm@sgi.com>
-
Dave Chinner authored
When we read a buffer, we might get an error from the underlying block device and not the real data. Hence if we get an IO error, we shouldn't run the verifier but instead just pass the IO error straight through. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Mark Tinguely authored
Fix the return type of xfs_iomap_eof_prealloc_initial_size() to xfs_fsblock_t to reflect the fact that the return value may be an unsigned 64 bits if XFS_BIG_BLKNOS is defined. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
The updated speculative preallocation algorithm for handling sparse files can becomes less effective in situations with a high number of concurrent, sequential writers. The number of writers and amount of available RAM affect the writeback bandwidth slicing algorithm, which in turn affects the block allocation pattern of XFS. For example, running 32 sequential writers on a system with 32GB RAM, preallocs become fixed at a value of around 128MB (instead of steadily increasing to the 8GB maximum as sequential writes proceed). Update the speculative prealloc heuristic to base the size of the next prealloc on double the size of the preceding extent. This preserves the original aggressive speculative preallocation behavior and continues to accomodate sparse files at a slight cost of increasing the size of preallocated data regions following holes of sparse files. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-
Brian Foster authored
If freesp == 0, we could end up in an infinite loop while squashing the preallocation. Break the loop when we've killed the prealloc entirely. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
-