1. 23 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: don't truncate prealloc from frequently accessed inodes · 6e857567
      Dave Chinner authored
      A long standing problem for streaming writeѕ through the NFS server
      has been that the NFS server opens and closes file descriptors on an
      inode for every write. The result of this behaviour is that the
      ->release() function is called on every close and that results in
      XFS truncating speculative preallocation beyond the EOF.  This has
      an adverse effect on file layout when multiple files are being
      written at the same time - they interleave their extents and can
      result in severe fragmentation.
      
      To avoid this problem, keep track of ->release calls made on a dirty
      inode. For most cases, an inode is only going to be opened once for
      writing and then closed again during it's lifetime in cache. Hence
      if there are multiple ->release calls when the inode is dirty, there
      is a good chance that the inode is being accessed by the NFS server.
      Hence set a flag the first time ->release is called while there are
      delalloc blocks still outstanding on the inode.
      
      If this flag is set when ->release is next called, then do no
      truncate away the speculative preallocation - leave it there so that
      subsequent writes do not need to reallocate the delalloc space. This
      will prevent interleaving of extents of different inodes written
      concurrently to the same AG.
      
      If we get this wrong, it is not a big deal as we truncate
      speculative allocation beyond EOF anyway in xfs_inactive() when the
      inode is thrown out of the cache.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6e857567
  2. 04 Jan, 2011 1 commit
    • Dave Chinner's avatar
      xfs: dynamic speculative EOF preallocation · 055388a3
      Dave Chinner authored
      Currently the size of the speculative preallocation during delayed
      allocation is fixed by either the allocsize mount option of a
      default size. We are seeing a lot of cases where we need to
      recommend using the allocsize mount option to prevent fragmentation
      when buffered writes land in the same AG.
      
      Rather than using a fixed preallocation size by default (up to 64k),
      make it dynamic by basing it on the current inode size. That way the
      EOF preallocation will increase as the file size increases.  Hence
      for streaming writes we are much more likely to get large
      preallocations exactly when we need it to reduce fragementation.
      
      For default settings, the size of the initial extents is determined
      by the number of parallel writers and the amount of memory in the
      machine. For 4GB RAM and 4 concurrent 32GB file writes:
      
      EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
         1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
         2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
         3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
         4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
         5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
         6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
         7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088
      
      and for 16 concurrent 16GB file writes:
      
       EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
         1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
         2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
         3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
         4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
         5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
         6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
         7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208
      
      Because it is hard to take back specualtive preallocation, cases
      where there are large slow growing log files on a nearly full
      filesystem may cause premature ENOSPC. Hence as the filesystem nears
      full, the maximum dynamic prealloc size іs reduced according to this
      table (based on 4k block size):
      
      freespace       max prealloc size
        >5%             full extent (8GB)
        4-5%             2GB (8GB >> 2)
        3-4%             1GB (8GB >> 3)
        2-3%           512MB (8GB >> 4)
        1-2%           256MB (8GB >> 5)
        <1%            128MB (8GB >> 6)
      
      This should reduce the amount of space held in speculative
      preallocation for such cases.
      
      The allocsize mount option turns off the dynamic behaviour and fixes
      the prealloc size to whatever the mount option specifies. i.e. the
      behaviour is unchanged.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      055388a3
  3. 23 Dec, 2010 2 commits
    • Dave Chinner's avatar
      xfs: use KM_NOFS for allocations during attribute list operations · 622d8149
      Dave Chinner authored
      When listing attributes, we are doiing memory allocations under the
      inode ilock using only KM_SLEEP. This allows memory allocation to
      recurse back into the filesystem and do writeback, which may the
      ilock we already hold on the current inode. THis will deadlock.
      Hence use KM_NOFS for such allocations outside of transaction
      context to ensure that reclaim recursion does not occur.
      Reported-by: default avatarNick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      622d8149
    • Dave Chinner's avatar
      xfs: provide a inode iolock lockdep class · dcfcf205
      Dave Chinner authored
      The XFS iolock needs to be re-initialised to a new lock class before
      it enters reclaim to prevent lockdep false positives. Unfortunately,
      this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
      state can be recycled and not re-initialised before being reused.
      
      We need to re-initialise the lock state when transfering out of
      XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
      class as if the inode was just allocated. Hence we need a specific
      lockdep class variable for the iolock so that both initialisations
      use the same class.
      
      While there, add a specific class for inodes in the reclaim state so
      that it is easy to tell from lockdep reports what state the inode
      was in that generated the report.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      dcfcf205
  4. 16 Dec, 2010 17 commits
  5. 09 Dec, 2010 1 commit
    • Christoph Hellwig's avatar
      xfs: log timestamp changes to the source inode in rename · 05340d4a
      Christoph Hellwig authored
      Now that we don't mark VFS inodes dirty anymore for internal
      timestamp changes, but rely on the transaction subsystem to push
      them out, we need to explicitly log the source inode in rename after
      updating it's timestamps to make sure the changes actually get
      forced out by sync/fsync or an AIL push.
      
      We already account for the fourth inode in the log reservation, as a
      rename of directories needs to update the nlink field, so just
      adding the xfs_trans_log_inode call is enough.
      
      This fixes the xfsqa 065 regression introduced by:
      
      	"xfs: don't use vfs writeback for pure metadata modifications"
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      05340d4a
  6. 01 Dec, 2010 5 commits
    • Dave Chinner's avatar
      xfs: only run xfs_error_test if error injection is active · c76febef
      Dave Chinner authored
      Recent tests writing lots of small files showed the flusher thread
      being CPU bound and taking a long time to do allocations on a debug
      kernel. perf showed this as the prime reason:
      
                   samples  pcnt function                    DSO
                   _______ _____ ___________________________ _________________
      
                 224648.00 36.8% xfs_error_test              [kernel.kallsyms]
                  86045.00 14.1% xfs_btree_check_sblock      [kernel.kallsyms]
                  39778.00  6.5% prandom32                   [kernel.kallsyms]
                  37436.00  6.1% xfs_btree_increment         [kernel.kallsyms]
                  29278.00  4.8% xfs_btree_get_rec           [kernel.kallsyms]
                  27717.00  4.5% random32                    [kernel.kallsyms]
      
      Walking btree blocks during allocation checking them requires each
      block (a cache hit, so no I/O) call xfs_error_test(), which then
      does a random32() call as the first operation.  IOWs, ~50% of the
      CPU is being consumed just testing whether we need to inject an
      error, even though error injection is not active.
      
      Kill this overhead when error injection is not active by adding a
      global counter of active error traps and only calling into
      xfs_error_test when fault injection is active.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c76febef
    • Dave Chinner's avatar
      xfs: avoid moving stale inodes in the AIL · de25c181
      Dave Chinner authored
      When an inode has been marked stale because the cluster is being
      freed, we don't want to (re-)insert this inode into the AIL. There
      is a race condition where the cluster buffer may be unpinned before
      the inode is inserted into the AIL during transaction committed
      processing. If the buffer is unpinned before the inode item has been
      committed and inserted, then it is possible for the buffer to be
      released and hence processthe stale inode callbacks before the inode
      is inserted into the AIL.
      
      In this case, we then insert a clean, stale inode into the AIL which
      will never get removed by an IO completion. It will, however, get
      reclaimed and that triggers an assert in xfs_inode_free()
      complaining about freeing an inode still in the AIL.
      
      This race can be avoided by not moving stale inodes forward in the AIL
      during transaction commit completion processing. This closes the
      race condition by ensuring we never insert clean stale inodes into
      the AIL. It is safe to do this because a dirty stale inode, by
      definition, must already be in the AIL.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      de25c181
    • Dave Chinner's avatar
      xfs: delayed alloc blocks beyond EOF are valid after writeback · 309c8480
      Dave Chinner authored
      There is an assumption in the parts of XFS that flushing a dirty
      file will make all the delayed allocation blocks disappear from an
      inode. That is, that after calling xfs_flush_pages() then
      ip->i_delayed_blks will be zero.
      
      This is an invalid assumption as we may have specualtive
      preallocation beyond EOF and they are recorded in
      ip->i_delayed_blks. A flush of the dirty pages of an inode will not
      change the state of these blocks beyond EOF, so a non-zero
      deeelalloc block count after a flush is valid.
      
      The bmap code has an invalid ASSERT() that needs to be removed, and
      the swapext code has a bug in that while it swaps the data forks
      around, it fails to swap the i_delayed_blks counter associated with
      the fork and hence can get the block accounting wrong.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      309c8480
    • Dave Chinner's avatar
      xfs: push stale, pinned buffers on trylock failures · 90810b9e
      Dave Chinner authored
      As reported by Nick Piggin, XFS is suffering from long pauses under
      highly concurrent workloads when hosted on ramdisks. The problem is
      that an inode buffer is stuck in the pinned state in memory and as a
      result either the inode buffer or one of the inodes within the
      buffer is stopping the tail of the log from being moved forward.
      
      The system remains in this state until a periodic log force issued
      by xfssyncd causes the buffer to be unpinned. The main problem is
      that these are stale buffers, and are hence held locked until the
      transaction/checkpoint that marked them state has been committed to
      disk. When the filesystem gets into this state, only the xfssyncd
      can cause the async transactions to be committed to disk and hence
      unpin the inode buffer.
      
      This problem was encountered when scaling the busy extent list, but
      only the blocking lock interface was fixed to solve the problem.
      Extend the same fix to the buffer trylock operations - if we fail to
      lock a pinned, stale buffer, then force the log immediately so that
      when the next attempt to lock it comes around, it will have been
      unpinned.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      90810b9e
    • Dave Chinner's avatar
      xfs: fix failed write truncation handling. · c726de44
      Dave Chinner authored
      Since the move to the new truncate sequence we call xfs_setattr to
      truncate down excessively instanciated blocks.  As shown by the testcase
      in kernel.org BZ #22452 that doesn't work too well.  Due to the confusion
      of the internal inode size, and the VFS inode i_size it zeroes data that
      it shouldn't.
      
      But full blown truncate seems like overkill here.  We only instanciate
      delayed allocations in the write path, and given that we never released
      the iolock we can't have converted them to real allocations yet either.
      
      The only nasty case is pre-existing preallocation which we need to skip.
      We already do this for page discard during writeback, so make the delayed
      allocation block punching a generic function and call it from the failed
      write path as well as xfs_aops_discard_page. The callers are
      responsible for ensuring that partial blocks are not truncated away,
      and that they hold the ilock.
      
      Based on a fix originally from Christoph Hellwig. This version used
      filesystem blocks as the range unit.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c726de44
  7. 30 Nov, 2010 2 commits
  8. 29 Nov, 2010 11 commits