Commits · ed3b4d6cdc81e8feefdbfa3c584614be301b6d39 · nexedi / linux

24 May, 2010 7 commits

xfs: Improve scalability of busy extent tracking · ed3b4d6c

Dave Chinner authored May 21, 2010

When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents.  This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

ed3b4d6c

xfs: make the log ticket ID available outside the log infrastructure · 955833cf

Dave Chinner authored May 14, 2010

The ticket ID is needed to uniquely identify transactions when doing busy
extent matching. Delayed logging changes the lifecycle of busy extents with
respect to the transaction structure lifecycle. Hence we can no longer use
the transaction structure as a means of determining the owner of the busy
extent as it may be freed and reused while the busy extent is still active.

This commit provides the infrastructure to access the xlog_tid_t held in the
ticket from a transaction handle. This avoids the need for callers to peek
into the transaction and log structures to find this out.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

955833cf

xfs: clean up log ticket overrun debug output · 169a7b07

Dave Chinner authored May 07, 2010

Push the error message output when a ticket overrun is detected
into the ticket printing functions. Also remove the debug version
of the code as the production version will still panic just as
effectively on a debug kernel via the panic mask being set.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

169a7b07

xfs: Clean up XFS_BLI_* flag namespace · c1155410

Dave Chinner authored May 07, 2010

Clean up the buffer log format (XFS_BLI_*) flags because they have a
polluted namespace. They XFS_BLI_ prefix is used for both in-memory
and on-disk flag feilds, but have overlapping values for different
flags. Rename the buffer log format flags to use the XFS_BLF_*
prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
flags.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

c1155410

xfs: modify buffer item reference counting · 64fc35de

Dave Chinner authored May 07, 2010

The buffer log item reference counts used to take referenceѕ for every
transaction, similar to the pin counting. This is symmetric (like the
pin/unpin) with respect to transaction completion, but with dleayed logging
becomes assymetric as the pinning becomes assymetric w.r.t. transaction
completion.

To make both cases the same, allow the buffer pinning to take a reference to
the buffer log item and always drop the reference the transaction has on it
when being unlocked. This is balanced correctly because the unpin operation
always drops a reference to the log item. Hence reference counting becomes
symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
result the reference counting model remain consistent between normal and
delayed logging.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

64fc35de

xfs: allow log ticket allocation to take allocation flags · 3383ca57

Dave Chinner authored May 07, 2010

Delayed logging currently requires ticket allocation to succeed, so
we need to be able to sleep on allocation. It also should not allow
memory allocation to recurse into the filesystem. hence we need to
pass allocation flags directing the type of allocation the caller
requires.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

3383ca57

xfs: Don't reuse the same transaction ID for duplicated transactions. · 524ee36f

Dave Chinner authored May 07, 2010

The transaction ID is written into the log as the unique identifier
for transactions during recover. When duplicating a transaction, we
reuse the log ticket, which means it has the same transaction ID as
the previous transaction.

Rather than regenerating a random transaction ID for the duplicated
transaction, just add one to the current ID so that duplicated
transaction can be easily spotted in the log and during recovery
during problem diagnosis.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

524ee36f

19 May, 2010 33 commits

xfs: mark xfs_iomap_write_ helpers static · b4ed4626

Christoph Hellwig authored Apr 28, 2010

And also drop a useless argument to xfs_iomap_write_direct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

b4ed4626

xfs: clean up end index calculation in xfs_page_state_convert · bd1556a1
Christoph Hellwig authored Apr 28, 2010
```
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
```
bd1556a1
xfs: clean up mapping size calculation in __xfs_get_blocks · 2b8f12b7
Christoph Hellwig authored Apr 28, 2010
```
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
```
2b8f12b7

xfs: clean up xfs_iomap_valid · 558e6891

Christoph Hellwig authored Apr 28, 2010

Rename all iomap_valid identifiers to imap_valid to fit the new
world order, and clean up xfs_iomap_valid to convert the passed in
offset to blocks instead of the imap values to bytes.  Use the
simpler inode->i_blkbits instead of the XFS macros for this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

558e6891

xfs: move I/O type flags into xfs_aops.c · 34a52c6c

Christoph Hellwig authored Apr 28, 2010

The IOMAP_ flags are now only used inside xfs_aops.c for extent
probing and I/O completion tracking, so more them here, and rename
them to IO_* as there's no mapping involved at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

34a52c6c

xfs: kill struct xfs_iomap · 207d0416

Christoph Hellwig authored Apr 28, 2010

Now that struct xfs_iomap contains exactly the same units as struct
xfs_bmbt_irec we can just use the latter directly in the aops code.
Replace the missing IOMAP_NEW flag with a new boolean output
parameter to xfs_iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

207d0416

xfs: report iomap_bn in block base · e513182d

Christoph Hellwig authored Apr 28, 2010

Report the iomap_bn field of struct xfs_iomap in terms of filesystem
blocks instead of in terms of bytes.  Shift the byte conversions
into the caller, and replace the IOMAP_DELAY and IOMAP_HOLE flag
checks with checks for HOLESTARTBLOCK and DELAYSTARTBLOCK.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

e513182d

xfs: report iomap_offset and iomap_bsize in block base · 8699bb0a

Christoph Hellwig authored Apr 28, 2010

Report the iomap_offset and iomap_bsize fields of struct xfs_iomap
in terms of fsblocks instead of in terms of disk blocks.  Shift the
byte conversions into the callers temporarily, but they will
disappear or get cleaned up later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

8699bb0a

xfs: remove iomap_delta · 9563b3d8

Christoph Hellwig authored Apr 28, 2010

The iomap_delta field in struct xfs_iomap just contains the
difference between the offset passed to xfs_iomap and the
iomap_offset.  Just calculate it in the only caller that cares.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

9563b3d8

xfs: remove iomap_target · 046f1685

Christoph Hellwig authored Apr 28, 2010

Instead of using the iomap_target field in struct xfs_iomap
and the IOMAP_REALTIME flag just use the already existing
xfs_find_bdev_for_inode helper.  There's some fallout as we
need to pass the inode in a few more places, which we also
use to sanitize some calling conventions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

046f1685

xfs: limit xfs_imap_to_bmap to a single mapping · 826bf0ad

Christoph Hellwig authored Apr 28, 2010

We only call xfs_iomap for single mappings anyway, so remove all
code dealing with multiple mappings from xfs_imap_to_bmap and add
asserts that we never get results that we do not expect.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

826bf0ad

xfs: simplify buffer to transaction matching · 4a5224d7

Christoph Hellwig authored Apr 18, 2010

We currenly have a routine xfs_trans_buf_item_match_all which checks
if any log item in a transaction contains a given buffer, and a
second one that only does this check for the first, embedded chunk
of log items.  We only use the second routine if we know we only
have that log item chunk, so get rid of the limited routine and
always use the more complete one.

Also rename the old xfs_trans_buf_item_match_all to
xfs_trans_buf_item_match and update various surrounding comments,
and move the remaining xfs_trans_buf_item_match on top of the file
to avoid a forward prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

4a5224d7

xfs: Make fiemap work in query mode. · 2d1ff3c7

Tao Ma authored Apr 29, 2010

According to Documentation/filesystems/fiemap.txt, If fm_extent_count
is zero, then the fm_extents[] array is ignored (no extents will be
returned), and the fm_mapped_extents count will hold the number of
extents needed.

But as the commit 97db39a1 has changed
bmv_count to the caller's input buffer, this number query function can't
work any more. As this commit is written to change bmv_count from
MAXEXTNUM because of ENOMEM.

This patch just try to  set bm.bmv_count to something sane.
Thanks to Dave Chinner <david@fromorbit.com> for the suggestion.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>

2d1ff3c7

xfs: kill off l_sectbb_mask · 48389ef1

Alex Elder authored Apr 20, 2010

There remains only one user of the l_sectbb_mask field in the log
structure.  Just kill it off and compute the mask where needed from
the power-of-2 sector size.

(Only update from last post is to accomodate the changes in the
previous patch in the series.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

48389ef1

xfs: record log sector size rather than log2(that) · 69ce58f0

Alex Elder authored Apr 20, 2010

Change struct log so it keeps track of the size (in basic blocks) of
a log sector in l_sectBBsize rather than the log-base-2 of that
value (previously, l_sectbb_log).  The name was chosen for
consistency with the other fields in the structure that represent
a number of basic blocks.

(Updated so that a variable used in computing and verifying a log's
sector size is named "log2_size".  Also added the "BB" to the
structure field name, based on feedback from Eric Sandeen.  Also
dropped some superfluous parentheses.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>

69ce58f0

xfs: remove dead XFS_LOUD_RECOVERY code · 1414a604

Christoph Hellwig authored Apr 20, 2010

This can't be enabled through the build system and has been dead for
ages.  Note that the CRC patches add back log checksumming, but the
code is quite different from the version removed here anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>

1414a604

xfs: removed unused XFS_QMOPT_ flags · 8112e9dc

Christoph Hellwig authored Apr 20, 2010

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>

8112e9dc

xfs: remove a few macro indirections in the quota code · 191f8488
Christoph Hellwig authored Apr 20, 2010
```
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
```
191f8488

xfs: access quotainfo structure directly · 8a7b8a89

Christoph Hellwig authored Apr 20, 2010

Access fields in m_quotainfo directly instead of hiding them behind the
XFS_QI_* macros.  Add local variables for the quotainfo pointer in places
where we have lots of them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>

8a7b8a89

xfs: wait for direct I/O to complete in fsync and write_inode · 37bc5743

Christoph Hellwig authored Apr 20, 2010

We need to wait for all pending direct I/O requests before taking care of
metadata in fsync and write_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>

37bc5743

xfs: xfs_trace.c: duplicated include · fce1cad6

Andrea Gelmini authored Mar 25, 2010

fs/xfs/linux-2.6/xfs_trace.c: xfs_attr_sf.h is included more than once.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Alex Elder <aelder@sgi.com>

fce1cad6

xfs: minor odds and ends in xfs_log_recover.c · 3f943d85

Alex Elder authored Apr 15, 2010

Odds and ends in "xfs_log_recover.c".  This patch just contains some
minor things that didn't seem to warrant their own individual
patches:
- In xlog_bread_noalign(), drop an assertion that a pointer is
  non-null (the crash will tell us it was a bad pointer).
- Add a more descriptive header comment for xlog_find_verify_cycle().
- Make a few additions to the comments in xlog_find_head().  Also
  rearrange some expressions in a few spots to produce the same
  result, but in a way that seems more clear what's being computed.

(Updated in response to Dave's review comments.  Note I did not
split this patch like I said I would.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

3f943d85

xfs: avoid repeated pointer dereferences · e3bb2e30

Alex Elder authored Apr 15, 2010

In xlog_find_cycle_start() use a local variable for some repeated
operations rather than constantly accessing the memory location
whose address is passed in.

(This version drops an assertion that a pointer is non-null.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

e3bb2e30

xfs: change a few labels in xfs_log_recover.c · 9db127ed

Alex Elder authored Apr 15, 2010

Rename a label used in xlog_find_head() that I thought was poorly
chosen.  Also combine two adjacent labels xlog_find_tail() into a
single label, and give it a more generic name.

(Now using Dave's suggested "validate_head" name for first label.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

9db127ed

xfs: enforce synchronous writes in xfs_bwrite · 8c38366f

Christoph Hellwig authored Mar 12, 2010

xfs_bwrite is used with the intention of synchronously writing out
buffers, but currently it does not actually clear the async flag if
that's left from previous writes but instead implements async
behaviour if it finds it.  Remove the code handling asynchronous
writes as we've got rid of those entirely outside of the log and
delwri buffers, and make sure that we clear the async and read flags
before writing the buffer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

8c38366f

xfs: remove periodic superblock writeback · df308bcf

Christoph Hellwig authored Mar 12, 2010

All modifications to the superblock are done transactional through
xfs_trans_log_buf, so there is no reason to initiate periodic
asynchronous writeback.  This only removes the superblock from the
delwri list and will lead to sub-optimal I/O scheduling.

Cut down xfs_sync_fsdata now that it's only used for synchronous
superblock writes and move the log coverage checks into the two
callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>

df308bcf

xfs: make the log ticket transaction id random · f9837107

Dave Chinner authored Apr 14, 2010

The transaction ID that is written to the log for a transaction is
currently set by taking the lower 32 bits of the memory address of
the ticket structure.  This is not guaranteed to be unique as
tickets comes from a slab and slots can be reallocated immediately
after being freed. As a result, there is no guarantee of uniqueness
in the ticket ID value.

Fix this by assigning a random number to the ticket ID field so that
it is extremely unlikely that duplicates will occur and remove the
possibility of transactions being mixed up during recovery due to
duplicate IDs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

f9837107

xfs: nothing special about 1-block log sector · 36adecff