1. 24 May, 2010 8 commits
    • Dave Chinner's avatar
      xfs: Delayed logging design documentation · a9a745da
      Dave Chinner authored
      Document the design of the delayed logging implementation. This
      includes assumptions made, dead ends followed, the reasoning behind
      the structuring of the code, the layout of various structures, how
      things fit together, traps and pit-falls avoided, etc. This is all
      too much to document in the code itself, so do it in a separate
      file.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      a9a745da
    • Dave Chinner's avatar
      xfs: Improve scalability of busy extent tracking · ed3b4d6c
      Dave Chinner authored
      When we free a metadata extent, we record it in the per-AG busy
      extent array so that it is not re-used before the freeing
      transaction hits the disk. This array is fixed size, so when it
      overflows we make further allocation transactions synchronous
      because we cannot track more freed extents until those transactions
      hit the disk and are completed. Under heavy mixed allocation and
      freeing workloads with large log buffers, we can overflow this array
      quite easily.
      
      Further, the array is sparsely populated, which means that inserts
      need to search for a free slot, and array searches often have to
      search many more slots that are actually used to check all the
      busy extents. Quite inefficient, really.
      
      To enable this aspect of extent freeing to scale better, we need
      a structure that can grow dynamically. While in other areas of
      XFS we have used radix trees, the extents being freed are at random
      locations on disk so are better suited to being indexed by an rbtree.
      
      So, use a per-AG rbtree indexed by block number to track busy
      extents.  This incures a memory allocation when marking an extent
      busy, but should not occur too often in low memory situations. This
      should scale to an arbitrary number of extents so should not be a
      limitation for features such as in-memory aggregation of
      transactions.
      
      However, there are still situations where we can't avoid allocating
      busy extents (such as allocation from the AGFL). To minimise the
      overhead of such occurences, we need to avoid doing a synchronous
      log force while holding the AGF locked to ensure that the previous
      transactions are safely on disk before we use the extent. We can do
      this by marking the transaction doing the allocation as synchronous
      rather issuing a log force.
      
      Because of the locking involved and the ordering of transactions,
      the synchronous transaction provides the same guarantees as a
      synchronous log force because it ensures that all the prior
      transactions are already on disk when the synchronous transaction
      hits the disk. i.e. it preserves the free->allocate order of the
      extent correctly in recovery.
      
      By doing this, we avoid holding the AGF locked while log writes are
      in progress, hence reducing the length of time the lock is held and
      therefore we increase the rate at which we can allocate and free
      from the allocation group, thereby increasing overall throughput.
      
      The only problem with this approach is that when a metadata buffer is
      marked stale (e.g. a directory block is removed), then buffer remains
      pinned and locked until the log goes to disk. The issue here is that
      if that stale buffer is reallocated in a subsequent transaction, the
      attempt to lock that buffer in the transaction will hang waiting
      the log to go to disk to unlock and unpin the buffer. Hence if
      someone tries to lock a pinned, stale, locked buffer we need to
      push on the log to get it unlocked ASAP. Effectively we are trading
      off a guaranteed log force for a much less common trigger for log
      force to occur.
      
      Ideally we should not reallocate busy extents. That is a much more
      complex fix to the problem as it involves direct intervention in the
      allocation btree searches in many places. This is left to a future
      set of modifications.
      
      Finally, now that we track busy extents in allocated memory, we
      don't need the descriptors in the transaction structure to point to
      them. We can replace the complex busy chunk infrastructure with a
      simple linked list of busy extents. This allows us to remove a large
      chunk of code, making the overall change a net reduction in code
      size.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      ed3b4d6c
    • Dave Chinner's avatar
      xfs: make the log ticket ID available outside the log infrastructure · 955833cf
      Dave Chinner authored
      The ticket ID is needed to uniquely identify transactions when doing busy
      extent matching. Delayed logging changes the lifecycle of busy extents with
      respect to the transaction structure lifecycle. Hence we can no longer use
      the transaction structure as a means of determining the owner of the busy
      extent as it may be freed and reused while the busy extent is still active.
      
      This commit provides the infrastructure to access the xlog_tid_t held in the
      ticket from a transaction handle. This avoids the need for callers to peek
      into the transaction and log structures to find this out.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      955833cf
    • Dave Chinner's avatar
      xfs: clean up log ticket overrun debug output · 169a7b07
      Dave Chinner authored
      Push the error message output when a ticket overrun is detected
      into the ticket printing functions. Also remove the debug version
      of the code as the production version will still panic just as
      effectively on a debug kernel via the panic mask being set.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      169a7b07
    • Dave Chinner's avatar
      xfs: Clean up XFS_BLI_* flag namespace · c1155410
      Dave Chinner authored
      Clean up the buffer log format (XFS_BLI_*) flags because they have a
      polluted namespace. They XFS_BLI_ prefix is used for both in-memory
      and on-disk flag feilds, but have overlapping values for different
      flags. Rename the buffer log format flags to use the XFS_BLF_*
      prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
      flags.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      c1155410
    • Dave Chinner's avatar
      xfs: modify buffer item reference counting · 64fc35de
      Dave Chinner authored
      The buffer log item reference counts used to take referenceѕ for every
      transaction, similar to the pin counting. This is symmetric (like the
      pin/unpin) with respect to transaction completion, but with dleayed logging
      becomes assymetric as the pinning becomes assymetric w.r.t. transaction
      completion.
      
      To make both cases the same, allow the buffer pinning to take a reference to
      the buffer log item and always drop the reference the transaction has on it
      when being unlocked. This is balanced correctly because the unpin operation
      always drops a reference to the log item. Hence reference counting becomes
      symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
      result the reference counting model remain consistent between normal and
      delayed logging.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      64fc35de
    • Dave Chinner's avatar
      xfs: allow log ticket allocation to take allocation flags · 3383ca57
      Dave Chinner authored
      Delayed logging currently requires ticket allocation to succeed, so
      we need to be able to sleep on allocation. It also should not allow
      memory allocation to recurse into the filesystem. hence we need to
      pass allocation flags directing the type of allocation the caller
      requires.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      3383ca57
    • Dave Chinner's avatar
      xfs: Don't reuse the same transaction ID for duplicated transactions. · 524ee36f
      Dave Chinner authored
      The transaction ID is written into the log as the unique identifier
      for transactions during recover. When duplicating a transaction, we
      reuse the log ticket, which means it has the same transaction ID as
      the previous transaction.
      
      Rather than regenerating a random transaction ID for the duplicated
      transaction, just add one to the current ID so that duplicated
      transaction can be easily spotted in the log and during recovery
      during problem diagnosis.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      524ee36f
  2. 19 May, 2010 32 commits