1. 27 Apr, 2013 12 commits
    • Dave Chinner's avatar
      xfs: buffer type overruns blf_flags field · 61fe135c
      Dave Chinner authored
      The buffer type passed to log recvoery in the buffer log item
      overruns the blf_flags field. I had assumed that flags field was a
      32 bit value, and it turns out it is a unisgned short. Therefore
      having 19 flags doesn't really work.
      
      Convert the buffer type field to numeric value, and use the top 5
      bits of the flags field for it. We currently have 17 types of
      buffers, so using 5 bits gives us plenty of room for expansion in
      future....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      61fe135c
    • Dave Chinner's avatar
      xfs: add buffer types to directory and attribute buffers · d75afeb3
      Dave Chinner authored
      Add buffer types to the buffer log items so that log recovery can
      validate the buffers and calculate CRCs correctly after the buffers
      are recovered.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d75afeb3
    • Dave Chinner's avatar
      xfs: add CRC protection to remote attributes · d2e448d5
      Dave Chinner authored
      There are two ways of doing this - the first is to add a CRC to the
      remote attribute entry in the attribute block. The second is to
      treat them similar to the remote symlink, where each fragment has
      it's own header and identifies fragment location in the attribute.
      
      The problem with the CRC in the remote attr entry is that we cannot
      identify the owner of the metadata from the metadata blocks
      themselves, or where the blocks fit into the remote attribute. The
      down side to this approach is that we never know when the attribute
      has been read from disk or not and so we have to verify it every
      time it is read, and we must calculate it during the create
      transaction and log it. We do not log CRCs for any other metadata,
      and so this creates a unique set of coherency problems that, in
      general, are best avoided.
      
      Adding an identifying header to each allocated block allows us to
      identify each fragment and where in the attribute it is located. It
      enables us to rebuild the remote attribute from just the raw blocks
      containing the attribute. It also provides us to do per-block CRCs
      verification at IO time rather than during the transaction context
      that creates it or every time it is read into a user buffer. Hence
      it avoids all the problems that an external, logged CRC has, and
      provides all the benefits of self identifying metadata.
      
      The only complexity is that we have to add a header per fragment,
      and we don't know how many fragments will be needed prior to
      allocations. If we take the symlink example, the header is 56 bytes
      and hence for a 4k block size filesystem, in the worst case 16
      headers requires 1 extra block for the 64k attribute data. For 512
      byte filesystems the worst case is an extra block for every 9
      fragments (i.e. 16 extra blocks in the worse case). This will be
      very rare and so it's not really a major concern.
      
      Because allocation is done in two steps - the first finds a hole
      large enough in the attribute file, the second does the allocation -
      we only need to find a hole big enough for a worst case allocation.
      We only need to allocate enough extra blocks for number of headers
      required by the fragments, and we can calculate that as we go....
      
      Hence it really only makes sense to use the same model as for
      symlinks - it doesn't add that much complexity, does not require an
      attribute tree format change, and does not require logging
      calculated CRC values.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d2e448d5
    • Dave Chinner's avatar
      xfs: split remote attribute code out · 95920cd6
      Dave Chinner authored
      Adding CRC support to remote attributes adds a significant amount of
      remote attribute specific code. Split the existing remote attribute
      code out into it's own file so that all the relevant remote
      attribute code is in a single, easy to find place.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      95920cd6
    • Dave Chinner's avatar
      xfs: add CRCs to attr leaf blocks · 517c2220
      Dave Chinner authored
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      517c2220
    • Dave Chinner's avatar
      xfs: add CRCs to dir2/da node blocks · f5ea1100
      Dave Chinner authored
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      f5ea1100
    • Dave Chinner's avatar
      xfs: shortform directory offsets change for dir3 format · 6b2647a1
      Dave Chinner authored
      Because the header size for the CRC enabled directory blocks is
      larger, the offset of the first entry into a directory block is
      different to the dir2 format. The shortform directory stores the
      dirent's offset so that it doesn't change when moving from shortform
      to block form and back again, and hence it needs to take into
      account the different header sizes to maintain the correct offsets.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      6b2647a1
    • Dave Chinner's avatar
      xfs: add CRC checking to dir2 leaf blocks · 24df33b4
      Dave Chinner authored
      This addition follows the same pattern as the dir2 block CRCs.
      Seeing as both LEAF1 and LEAFN types need to changed at the same
      time, this is a pretty large amount of change. leaf block headers
      need to be abstracted away from the on-disk structures (struct
      xfs_dir3_icleaf_hdr), as do the base leaf entry locations.
      
      This header abstract allows the in-core header and leaf entry
      location to be passed around instead of the leaf block itself. This
      saves a lot of converting individual variables from on-disk format
      to host format where they are used, so there's a good chance that
      the compiler will be able to produce much more optimal code as it's
      not having to byteswap variables all over the place.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      24df33b4
    • Dave Chinner's avatar
      xfs: add CRC checking to dir2 data blocks · 33363fee
      Dave Chinner authored
      This addition follows the same pattern as the dir2 block CRCs.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      33363fee
    • Dave Chinner's avatar
      xfs: add CRC checking to dir2 free blocks · cbc8adf8
      Dave Chinner authored
      This addition follows the same pattern as the dir2 block CRCs, but
      with a few differences. The main difference is that the free block
      header is different between the v2 and v3 formats, so an "in-core"
      free block header has been added and _todisk/_from_disk functions
      used to abstract the differences in structure format from the code.
      This is similar to the on-disk superblock versus the in-core
      superblock setup. The in-core strucutre is populated when the buffer
      is read from disk, all the in memory checks and modifications are
      done on the in-core version of the structure which is written back
      to the buffer before the buffer is logged.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      cbc8adf8
    • Dave Chinner's avatar
      xfs: add CRC checks to block format directory blocks · f5f3d9b0
      Dave Chinner authored
      Now that directory buffers are made from a single struct xfs_buf, we
      can add CRC calculation and checking callbacks. While there, add all
      the fields to the on disk structures for future functionality such
      as d_type support, uuids, block numbers, owner inode, etc.
      
      To distinguish between the different on disk formats, change the
      magic numbers for the new format directory blocks.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      f5f3d9b0
    • Dave Chinner's avatar
      xfs: add CRC checks to remote symlinks · f948dd76
      Dave Chinner authored
      Add a header to the remote symlink block, containing location and
      owner information, as well as CRCs and LSN fields. This requires
      verifiers to be added to the remote symlink buffers for CRC enabled
      filesystems.
      
      This also fixes a bug reading multiple block symlinks, where the second
      block overwrites the first block when copying out the link name.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      f948dd76
  2. 21 Apr, 2013 8 commits
  3. 16 Apr, 2013 2 commits
  4. 05 Apr, 2013 1 commit
    • Dave Chinner's avatar
      xfs: don't free EFIs before the EFDs are committed · 666d644c
      Dave Chinner authored
      Filesystems are occasionally being shut down with this error:
      
      xfs_trans_ail_delete_bulk: attempting to delete a log item that is
      not in the AIL.
      
      It was diagnosed to be related to the EFI/EFD commit order when the
      EFI and EFD are in different checkpoints and the EFD is committed
      before the EFI here:
      
      http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
      
      The real problem is that a single bit cannot fully describe the
      states that the EFI/EFD processing can be in. These completion
      states are:
      
      EFI			EFI in AIL	EFD		Result
      committed/unpinned	Yes		committed	OK
      committed/pinned	No		committed	Shutdown
      uncommitted		No		committed	Shutdown
      
      
      Note that the "result" field is what should happen, not what does
      happen. The current logic is broken and handles the first two cases
      correctly by luck.  That is, the code will free the EFI if the
      XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The
      inverted logic "works" because if both EFI and EFD are committed,
      then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
      bit, and the second frees the EFI item. Hence as long as
      xfs_efi_item_committed() has been called, everything appears to be
      fine.
      
      It is the third case where the logic fails - where
      xfs_efd_item_committed() is called before xfs_efi_item_committed(),
      and that results in the EFI being freed before it has been
      committed. That is the bug that triggered the shutdown, and hence
      keeping track of whether the EFI has been committed or not is
      insufficient to correctly order the EFI/EFD operations w.r.t. the
      AIL.
      
      What we really want is this: the EFI is always placed into the
      AIL before the last reference goes away. The only way to guarantee
      that is that the EFI is not freed until after it has been unpinned
      *and* the EFD has been committed. That is, restructure the logic so
      that the only case that can occur is the first case.
      
      This can be done easily by replacing the XFS_EFI_COMMITTED with an
      EFI reference count. The EFI is initialised with it's own count, and
      that is not released until it is unpinned. However, there is a
      complication to this method - the high level EFI/EFD code in
      xfs_bmap_finish() does not hold direct references to the EFI
      structure, and runs a transaction commit between the EFI and EFD
      processing. Hence the EFI can be freed even before the EFD is
      created using such a method.
      
      Further, log recovery uses the AIL for tracking EFI/EFDs that need
      to be recovered, but it uses the AIL *differently* to the EFI
      transaction commit. Hence log recovery never pins or unpins EFIs, so
      we can't drop the EFI reference count indirectly to free the EFI.
      
      However, this doesn't prevent us from using a reference count here.
      There is a 1:1 relationship between EFIs and EFDs, so when we
      initialise the EFI we can take a reference count for the EFD as
      well. This solves the xfs_bmap_finish() issue - the EFI will never
      be freed until the EFD is processed. In terms of log recovery,
      during the committing of the EFD we can look for the
      XFS_EFI_RECOVERED bit being set and drop the EFI reference as well,
      thereby ensuring everything works correctly there as well.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      666d644c
  5. 03 Apr, 2013 1 commit
  6. 22 Mar, 2013 7 commits
  7. 14 Mar, 2013 3 commits
  8. 07 Mar, 2013 6 commits
    • Dave Chinner's avatar
      xfs: rearrange some code in xfs_bmap for better locality · 9e5987a7
      Dave Chinner authored
      xfs_bmap.c is a big file, and some of the related code is spread all
      throughout the file requiring function prototypes for static
      function and jumping all through the file to follow a single call
      path. Rearrange the code so that:
      
      	a) related functionality is grouped together; and
      	b) functions are grouped in call dependency order
      
      While the diffstat is large, there are no code changes in the patch;
      it is just moving the functionality around and removing the function
      prototypes at the top of the file. The resulting layout of the code
      is as follows (top of file to bottom):
      
      	- miscellaneous helper functions
      	- extent tree block counting routines
      	- debug/sanity checking code
      	- bmap free list manipulation functions
      	- inode fork format manipulation functions
      	- internal/external extent tree seach functions
      	- extent tree manipulation functions used during allocation
      	- functions used during extent read/allocate/removal
      	  operations (i.e. xfs_bmapi_write, xfs_bmapi_read,
      	  xfs_bunmapi and xfs_getbmap)
      
      This means that following logic paths through the bmapi code is much
      simpler - most of the code relevant to a specific operation is now
      clustered together rather than spread all over the file....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      9e5987a7
    • Akinobu Mita's avatar
      xfs: rename random32() to prandom_u32() · ecb3403d
      Akinobu Mita authored
      Use more preferable function name which implies using a pseudo-random
      number generator.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: <bpm@sgi.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: xfs@oss.sgi.com
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      ecb3403d
    • Dave Chinner's avatar
      xfs: don't verify buffers after IO errors · d5929de8
      Dave Chinner authored
      When we read a buffer, we might get an error from the underlying
      block device and not the real data. Hence if we get an IO error, we
      shouldn't run the verifier but instead just pass the IO error
      straight through.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d5929de8
    • Mark Tinguely's avatar
      xfs: fix xfs_iomap_eof_prealloc_initial_size type · e8108ced
      Mark Tinguely authored
      Fix the return type of xfs_iomap_eof_prealloc_initial_size() to
      xfs_fsblock_t to reflect the fact that the return value may be an
      unsigned 64 bits if XFS_BIG_BLKNOS is defined.
      Signed-off-by: default avatarMark Tinguely <tinguely@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      e8108ced
    • Brian Foster's avatar
      xfs: increase prealloc size to double that of the previous extent · e114b5fc
      Brian Foster authored
      The updated speculative preallocation algorithm for handling sparse
      files can becomes less effective in situations with a high number of
      concurrent, sequential writers. The number of writers and amount of
      available RAM affect the writeback bandwidth slicing algorithm,
      which in turn affects the block allocation pattern of XFS. For
      example, running 32 sequential writers on a system with 32GB RAM,
      preallocs become fixed at a value of around 128MB (instead of
      steadily increasing to the 8GB maximum as sequential writes
      proceed).
      
      Update the speculative prealloc heuristic to base the size of the
      next prealloc on double the size of the preceding extent. This
      preserves the original aggressive speculative preallocation
      behavior and continues to accomodate sparse files at a slight cost
      of increasing the size of preallocated data regions following holes
      of sparse files.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      e114b5fc
    • Brian Foster's avatar
      xfs: fix potential infinite loop in xfs_iomap_prealloc_size() · e78c420b
      Brian Foster authored
      If freesp == 0, we could end up in an infinite loop while squashing
      the preallocation. Break the loop when we've killed the prealloc
      entirely.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      e78c420b