1. 14 May, 2012 40 commits
    • Jeff Liu's avatar
      xfs: introduce SEEK_DATA/SEEK_HOLE support · 3fe3e6b1
      Jeff Liu authored
      This patch adds lseek(2) SEEK_DATA/SEEK_HOLE functionality to xfs.
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      3fe3e6b1
    • Ben Myers's avatar
      xfs: make xfs_extent_busy_trim not static · e700a06c
      Ben Myers authored
      Commit e459df5, 'xfs: move busy extent handling to it's own file'
      moved some code from xfs_alloc.c into xfs_extent_busy.c for
      convenience in userspace code merges.  One of the functions moved is
      xfs_extent_busy_trim (formerly xfs_alloc_busy_trim) which is defined
      STATIC.  Unfortunately this function is still used in xfs_alloc.c, and
      this results in an undefined symbol in xfs.ko.
      
      Make xfs_extent_busy_trim not static and add its prototype to
      xfs_extent_busy.h.
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      e700a06c
    • Dave Chinner's avatar
      xfs: make XBF_MAPPED the default behaviour · 611c9946
      Dave Chinner authored
      Rather than specifying XBF_MAPPED for almost all buffers, introduce
      XBF_UNMAPPED for the couple of users that use unmapped buffers.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      611c9946
    • Dave Chinner's avatar
      xfs: flush outstanding buffers on log mount failure · d4f3512b
      Dave Chinner authored
      When we fail to mount the log in xfs_mountfs(), we tear down all the
      infrastructure we have already allocated. However, the process of
      mounting the log may have progressed to the point of reading,
      caching and modifying buffers in memory. Hence before we can free
      all the infrastructure, we have to flush and remove all the buffers
      from memory.
      
      Problem first reported by Eric Sandeen, later a different incarnation
      was reported by Ben Myers.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d4f3512b
    • Dave Chinner's avatar
      xfs: Properly exclude IO type flags from buffer flags · 12bcb3f7
      Dave Chinner authored
      Recent event tracing during a debugging session showed that flags
      that define the IO type for a buffer are leaking into the flags on
      the buffer incorrectly. Fix the flag exclusion mask in
      xfs_buf_alloc() to avoid problems that may be caused by such
      leakage.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      12bcb3f7
    • Dave Chinner's avatar
      xfs: clean up xfs_bit.h includes · ad1e95c5
      Dave Chinner authored
      With the removal of xfs_rw.h and other changes over time, xfs_bit.h
      is being included in many files that don't actually need it. Clean
      up the includes as necessary.
      
      Also move the only-used-once xfs_ialloc_find_free() static inline
      function out of a header file that is widely included to reduce
      the number of needless dependencies on xfs_bit.h.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      ad1e95c5
    • Dave Chinner's avatar
      xfs: move xfs_do_force_shutdown() and kill xfs_rw.c · 2af51f3a
      Dave Chinner authored
      xfs_do_force_shutdown now is the only thing in xfs_rw.c. There is no
      need to keep it in it's own file anymore, so move it to xfs_fsops.c
      next to xfs_fs_goingdown() and kill xfs_rw.c.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      2af51f3a
    • Dave Chinner's avatar
      xfs: move xfs_get_extsz_hint() and kill xfs_rw.h · 2a0ec1d9
      Dave Chinner authored
      The only thing left in xfs_rw.h is a function prototype for an inode
      function.  Move that to xfs_inode.h, and kill xfs_rw.h.
      
      Also move the function implementing the prototype from xfs_rw.c to
      xfs_inode.c so we only have one function left in xfs_rw.c
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      2a0ec1d9
    • Dave Chinner's avatar
      xfs: move xfs_fsb_to_db to xfs_bmap.h · fd50092c
      Dave Chinner authored
      This is the only remaining useful function in xfs_rw.h, so move it
      to a header file responsible for block mapping functions that the
      callers already include. Soon we can get rid of xfs_rw.h.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      fd50092c
    • Dave Chinner's avatar
      xfs: clean up busy extent naming · 4ecbfe63
      Dave Chinner authored
      Now that the busy extent tracking has been moved out of the
      allocation files, clean up the namespace it uses to
      "xfs_extent_busy" rather than a mix of "xfs_busy" and
      "xfs_alloc_busy".
      
      Signed-off-by: Dave Chinner<dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      4ecbfe63
    • Dave Chinner's avatar
      xfs: move busy extent handling to it's own file · efc27b52
      Dave Chinner authored
      To make it easier to handle userspace code merges, move all the busy
      extent handling out of the allocation code and into it's own file.
      The userspace code does not need the busy extent code, so this
      simplifies the merging of the kernel code into the userspace
      xfsprogs library.
      
      Because the busy extent code has been almost completely rewritten
      over the past couple of years, also update the copyright on this new
      file to include the authors that made all those changes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      efc27b52
    • Dave Chinner's avatar
      xfs: move xfsagino_t to xfs_types.h · 60a34607
      Dave Chinner authored
      Untangle the header file includes a bit by moving the definition of
      xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
      xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
      xfs_ag.h.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      60a34607
    • Dave Chinner's avatar
      xfs: use iolock on XFS_IOC_ALLOCSP calls · bc4010ec
      Dave Chinner authored
      fsstress has a particular effective way of stopping debug XFS
      kernels. We keep seeing assert failures due finding delayed
      allocation extents where there should be none. This shows up when
      extracting extent maps and we are holding all the locks we should be
      to prevent races, so this really makes no sense to see these errors.
      
      After checking that fsstress does not use mmap, it occurred to me
      that fsstress uses something that no sane application uses - the
      XFS_IOC_ALLOCSP ioctl interfaces for preallocation. These interfaces
      do allocation of blocks beyond EOF without using preallocation, and
      then call setattr to extend and zero the allocated blocks.
      
      THe problem here is this is a buffered write, and hence the
      allocation is a delayed allocation. Unlike the buffered IO path, the
      allocation and zeroing are not serialised using the IOLOCK. Hence
      the ALLOCSP operation can race with operations holding the iolock to
      prevent buffered IO operations from occurring.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      bc4010ec
    • Dave Chinner's avatar
      xfs: kill XBF_DONTBLOCK · aa5c158e
      Dave Chinner authored
      Just about all callers of xfs_buf_read() and xfs_buf_get() use XBF_DONTBLOCK.
      This is used to make memory allocation use GFP_NOFS rather than GFP_KERNEL to
      avoid recursion through memory reclaim back into the filesystem.
      
      All the blocking get calls in growfs occur inside a transaction, even though
      they are no part of the transaction, so all allocation will be GFP_NOFS due to
      the task flag PF_TRANS being set. The blocking read calls occur during log
      recovery, so they will probably be unaffected by converting to GFP_NOFS
      allocations.
      
      Hence make XBF_DONTBLOCK behaviour always occur for buffers and kill the flag.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      aa5c158e
    • Dave Chinner's avatar
      xfs: kill xfs_read_buf() · 7ca790a5
      Dave Chinner authored
      xfs_read_buf() is effectively the same as xfs_trans_read_buf() when called
      outside a transaction context. The error handling is slightly different in that
      xfs_read_buf stales the errored buffer it gets back, but there is probably good
      reason for xfs_trans_read_buf() for doing this.
      
      Hence update xfs_trans_read_buf() to the same error handling as xfs_read_buf(),
      and convert all the callers of xfs_read_buf() to use the former function. We can
      then remove xfs_read_buf().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      7ca790a5
    • Dave Chinner's avatar
      xfs: kill XBF_LOCK · a8acad70
      Dave Chinner authored
      Buffers are always returned locked from the lookup routines. Hence
      we don't need to tell the lookup routines to return locked buffers,
      on to try and lock them. Remove XBF_LOCK from all the callers and
      from internal buffer cache usage.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      a8acad70
    • Dave Chinner's avatar
      xfs: kill xfs_buf_btoc · 795cac72
      Dave Chinner authored
      xfs_buf_btoc and friends are simple macros that do basic block
      to page index conversion and vice versa. These aren't widely used,
      and we use open coded masking and shifting everywhere else. Hence
      remove the macros and open code the work they do.
      
      Also, use of PAGE_CACHE_{SIZE|SHIFT|MASK} for these macros is now
      incorrect - we are using pages directly and not the page cache, so
      use PAGE_{SIZE|MASK|SHIFT} instead.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      795cac72
    • Dave Chinner's avatar
      xfs: use blocks for storing the desired IO size · aa0e8833
      Dave Chinner authored
      Now that we pass block counts everywhere, and index buffers by block
      number and length in units of blocks, convert the desired IO size
      into block counts rather than bytes. Convert the code to use block
      counts, and those that need byte counts get converted at the time of
      use.
      
      Rename the b_desired_count variable to something closer to it's
      purpose - b_io_length - as it is only used to specify the length of
      an IO for a subset of the buffer.  The only time this is used is for
      log IO - both writing iclogs and during log recovery. In all other
      cases, the b_io_length matches b_length, and hence a lot of code
      confuses the two. e.g. the buf item code uses the io count
      exclusively when it should be using the buffer length. Fix these
      apprpriately as they are found.
      
      Also, remove the XFS_BUF_{SET_}COUNT() macros that are just wrappers
      around the desired IO length. They only serve to make the code
      shouty loud, don't actually add any real value, and are often used
      incorrectly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      aa0e8833
    • Dave Chinner's avatar
      xfs: use blocks for counting length of buffers · 4e94b71b
      Dave Chinner authored
      Now that we pass block counts everywhere, and index buffers by block
      number, track the length of the buffer in units of blocks rather
      than bytes. Convert the code to use block counts, and those that
      need byte counts get converted at the time of use.
      
      Also, remove the XFS_BUF_{SET_}SIZE() macros that are just wrappers
      around the buffer length. They only serve to make the code shouty
      loud and don't actually add any real value.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      4e94b71b
    • Dave Chinner's avatar
      xfs: kill b_file_offset · de1cbee4
      Dave Chinner authored
      Seeing as we pass block numbers around everywhere in the buffer
      cache now, it makes no sense to index everything by byte offset.
      Replace all the byte offset indexing with block number based
      indexing, and replace all uses of the byte offset with direct
      conversion from the block index.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      de1cbee4
    • Dave Chinner's avatar
      xfs: clean up buffer get/read call API · e70b73f8
      Dave Chinner authored
      The xfs_buf_get/read API is not consistent in the units it uses, and
      does not use appropriate or consistent units/types for the
      variables.
      
      Convert the API to use disk addresses and block counts for all
      buffer get and read calls. Use consistent naming for all the
      functions and their declarations, and convert the internal functions
      to use disk addresses and block counts to avoid need to convert them
      from one type to another and back again.
      
      Fix all the callers to use disk addresses and block counts. In many
      cases, this removes an additional conversion from the function call
      as the callers already have a block count.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      e70b73f8
    • Dave Chinner's avatar
      xfs: use kmem_zone_zalloc for buffers · bf813cdd
      Dave Chinner authored
      To replace the alloc/memset pair.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      bf813cdd
    • Dave Chinner's avatar
      xfs: fix incorrect b_offset initialisation · ead360c5
      Dave Chinner authored
      Because we no longer use the page cache for buffering, there is no
      direct block number to page offset relationship anymore.
      xfs_buf_get_pages is still setting up b_offset as if there was some
      relationship, and that is leading to incorrectly setting up
      *uncached* buffers that don't overwrite b_offset once they've had
      pages allocated.
      
      For cached buffers, the first block of the buffer is always at offset
      zero into the allocated memory. This is true for sub-page sized
      buffers, as well as for multiple-page buffers.
      
      For uncached buffers, b_offset is only non-zero when we are
      associating specific memory to the buffers, and that is set
      correctly by the code setting up the buffer.
      
      Hence remove the setting of b_offset in xfs_buf_get_pages, because
      it is now always the wrong thing to do.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      ead360c5
    • Dave Chinner's avatar
      xfs: check for buffer errors before waiting · 0e95f19a
      Dave Chinner authored
      If we call xfs_buf_iowait() on a buffer that failed dispatch due to
      an IO error, it will wait forever for an Io that does not exist.
      This is hndled in xfs_buf_read, but there is other code that calls
      xfs_buf_iowait directly that doesn't.
      
      Rather than make the call sites have to handle checking for dispatch
      errors and then checking for completion errors, make
      xfs_buf_iowait() check for dispatch errors on the buffer before
      waiting. This means we handle both dispatch and completion errors
      with one set of error handling at the caller sites.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      0e95f19a
    • Dave Chinner's avatar
      xfs: fix buffer lookup race on allocation failure · fe2429b0
      Dave Chinner authored
      When memory allocation fails to add the page array or tht epages to
      a buffer during xfs_buf_get(), the buffer is left in the cache in a
      partially initialised state. There is enough state left for the next
      lookup on that buffer to find the buffer, and for the buffer to then
      be used without finishing the initialisation.  As a result, when an
      attempt to do IO on the buffer occurs, it fails with EIO because
      there are no pages attached to the buffer.
      
      We cannot remove the buffer from the cache immediately and free it,
      because there may already be a racing lookup that is blocked on the
      buffer lock. Hence the moment we unlock the buffer to then free it,
      the other user is woken and we have a use-after-free situation.
      
      To avoid this race condition altogether, allocate the pages for the
      buffer before we insert it into the cache.  This then means that we
      don't have an allocation  failure case to deal after the buffer is
      already present in the cache, and hence avoid the problem
      altogether.  In most cases we won't have racing inserts for the same
      buffer, and so won't increase the memory pressure allocation before
      insertion may entail.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      fe2429b0
    • Dave Chinner's avatar
      xfs: Use preallocation for inodes with extsz hints · aff3a9ed
      Dave Chinner authored
      xfstest 229 exposes a problem with buffered IO, delayed allocation
      and extent size hints. That is when we do delayed allocation during
      buffered IO, we reserve space for the extent size hint alignment and
      allocate the physical space to align the extent, but we do not zero
      the regions of the extent that aren't written by the write(2)
      syscall. The result is that we expose stale data in unwritten
      regions of the extent size hints.
      
      There are two ways to fix this. The first is to detect that we are
      doing unaligned writes, check if there is already a mapping or data
      over the extent size hint range, and if not zero the page cache
      first before then doing the real write. This can be very expensive
      for large extent size hints, especially if the subsequent writes
      fill then entire extent size before the data is written to disk.
      
      The second, and simpler way, is simply to turn off delayed
      allocation when the extent size hint is set and use preallocation
      instead. This results in unwritten extents being laid down on disk
      and so only the written portions will be converted. This matches the
      behaviour for direct IO, and will also work for the real time
      device. The disadvantage of this approach is that for small extent
      size hints we can get file fragmentation, but in general extent size
      hints are fairly large (e.g. stripe width sized) so this isn't a big
      deal.
      
      Implement the second approach as it is simple and effective.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      aff3a9ed
    • Dave Chinner's avatar
      xfs: limit specualtive delalloc to maxioffset · 3ed9116e
      Dave Chinner authored
      Speculative delayed allocation beyond EOF near the maximum supported
      file offset can result in creating delalloc extents beyond
      mp->m_maxioffset (8EB). These can never be trimmed during
      xfs_free_eof_blocks() because they are beyond mp->m_maxioffset, and
      that results in assert failures in xfs_fs_destroy_inode() due to
      delalloc blocks still being present. xfstests 071 exposes this
      problem.
      
      Limit speculative delalloc to mp->m_maxioffset to avoid this
      problem.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      3ed9116e
    • Dave Chinner's avatar
      xfs: don't assert on delalloc regions beyond EOF · 58e20770
      Dave Chinner authored
      When we are doing speculative delayed allocation beyond EOF,
      conversion of the region allocated beyond EOF is dependent on the
      largest free space extent available. If the largest free extent is
      smaller than the delalloc range, then after allocation we leave
      a delalloc extent that starts beyond EOF. This extent cannot *ever*
      be converted by flushing data, and so will remain there until either
      the EOF moves into the extent or it is truncated away.
      
      Hence if xfs_getbmap() runs on such an inode and is asked to return
      extents beyond EOF, it will assert fail on this extent even though
      there is nothing xfs_getbmap() can do to convert it to a real
      extent. Hence we should simply report these delalloc extents rather
      than assert that there should be none.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      58e20770
    • Dave Chinner's avatar
      xfs: prevent needless mount warning causing test failures · 81158e0c
      Dave Chinner authored
      Often mounting small filesystem with small logs will emit a warning
      such as:
      
      XFS (vdb): Invalid block length (0x2000) for buffer
      
      during log recovery. This causes tests to randomly fail because this
      output causes the clean filesystem checks on test completion to
      think the filesystem is inconsistent.
      
      The cause of the error is simply that log recovery is asking for a
      buffer size that is larger than the log when zeroing the tail. This
      is because the buffer size is rounded up, and if the right head and
      tail conditions exist then the buffer size can be larger than the log.
      Limit the variable size xlog_get_bp() callers to requesting buffers
      smaller than the log.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      81158e0c
    • Dave Chinner's avatar
      xfs: punch new delalloc blocks out of failed writes inside EOF. · d3bc815a
      Dave Chinner authored
      When a partial write inside EOF fails, it can leave delayed
      allocation blocks lying around because they don't get punched back
      out. This leads to assert failures like:
      
      XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847
      
      when evicting inodes from the cache. This can be trivially triggered
      by xfstests 083, which takes between 5 and 15 executions on a 512
      byte block size filesystem to trip over this. Debugging shows a
      failed write due to ENOSPC calling xfs_vm_write_failed such as:
      
      [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae
      
      and no action is taken on it. This leaves behind a delayed
      allocation extent that has no page covering it and no data in it:
      
      [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0
      [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1
      [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b
      [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1
                                          ^^^^^^^^^^^^^^^
      [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd
      [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa
      
      So the delayed allocation extent is one block long at offset
      0x16c00. Tracing shows that a bigger write:
      
      xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags
      
      allocates the block, and then fails with ENOSPC trying to allocate
      the last block on the page, leading to a failed write with stale
      delalloc blocks on it.
      
      Because we've had an ENOSPC when trying to allocate 0x16e00, it
      means that we are never goinge to call ->write_end on the page and
      so the allocated new buffer will not get marked dirty or have the
      buffer_new state cleared. In other works, what the above write is
      supposed to end up with is this mapping for the page:
      
          +------+------+------+------+------+------+------+------+
            UMA    UMA    UMA    UMA    UMA    UMA    UND    FAIL
      
      where:  U = uptodate
              M = mapped
              N = new
              A = allocated
              D = delalloc
              FAIL = block we ENOSPC'd on.
      
      and the key point being the buffer_new() state for the newly
      allocated delayed allocation block. Except it doesn't - we're not
      marking buffers new correctly.
      
      That buffer_new() problem goes back to the xfs_iomap removal days,
      where xfs_iomap() used to return a "new" status for any map with
      newly allocated blocks, so that __xfs_get_blocks() could call
      set_buffer_new() on it. We still have the "new" variable and the
      check for it in the set_buffer_new() logic - except we never set it
      now!
      
      Hence that newly allocated delalloc block doesn't have the new flag
      set on it, so when the write fails we cannot tell which blocks we
      are supposed to punch out. WHy do we need the buffer_new flag? Well,
      that's because we can have this case:
      
          +------+------+------+------+------+------+------+------+
            UMD    UMD    UMD    UMD    UMD    UMD    UND    FAIL
      
      where all the UMD buffers contain valid data from a previously
      successful write() system call. We only want to punch the UND buffer
      because that's the only one that we added in this write and it was
      only this write that failed.
      
      That implies that even the old buffer_new() logic was wrong -
      because it would result in all those UMD buffers on the page having
      set_buffer_new() called on them even though they aren't new. Hence
      we shoul donly be calling set_buffer_new() for delalloc buffers that
      were allocated (i.e. were a hole before xfs_iomap_write_delay() was
      called).
      
      So, fix this set_buffer_new logic according to how we need it to
      work for handling failed writes correctly. Also, restore the new
      buffer logic handling for blocks allocated via
      xfs_iomap_write_direct(), because it should still set the buffer_new
      flag appropriately for newly allocated blocks, too.
      
      SO, now we have the buffer_new() being set appropriately in
      __xfs_get_blocks(), we can detect the exact delalloc ranges that
      we allocated in a failed write, and hence can now do a walk of the
      buffers on a page to find them.
      
      Except, it's not that easy. When block_write_begin() fails, it
      unlocks and releases the page that we just had an error on, so we
      can't use that page to handle errors anymore. We have to get access
      to the page while it is still locked to walk the buffers. Hence we
      have to open code block_write_begin() in xfs_vm_write_begin() to be
      able to insert xfs_vm_write_failed() is the right place.
      
      With that, we can pass the page and write range to
      xfs_vm_write_failed() and walk the buffers on the page, looking for
      delalloc buffers that are either new or beyond EOF and punch them
      out. Handling buffers beyond EOF ensures we still handle the
      existing case that xfs_vm_write_failed() handles.
      
      Of special note is the truncate_pagecache() handling - that only
      should be done for pages outside EOF - pages within EOF can still
      contain valid, dirty data so we must not punch them out of the
      cache.
      
      That just leaves the xfs_vm_write_end() failure handling.
      The only failure case here is that we didn't copy the entire range,
      and generic_write_end() handles that by zeroing the region of the
      page that wasn't copied, we don't have to punch out blocks within
      the file because they are guaranteed to contain zeros. Hence we only
      have to handle the existing "beyond EOF" case and don't need access
      to the buffers on the page. Hence it remains largely unchanged.
      
      Note that xfs_getbmap() can still trip over delalloc blocks beyond
      EOF that are left there by speculative delayed allocation. Hence
      this bug fix does not solve all known issues with bmap vs delalloc,
      but it does fix all the the known accidental occurances of the
      problem.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d3bc815a
    • Dave Chinner's avatar
      xfs: page type check in writeback only checks last buffer · 6ffc4db5
      Dave Chinner authored
      xfs_is_delayed_page() checks to see if a page has buffers matching
      the given IO type passed in. It does so by walking the buffer heads
      on the page and checking if the state flags match the IO type.
      
      However, the "acceptable" variable that is calculated is overwritten
      every time a new buffer is checked. Hence if the first buffer on the
      page is of the right type, this state is lost if the second buffer
      is not of the correct type. This means that xfs_aops_discard_page()
      may not discard delalloc regions when it is supposed to, and
      xfs_convert_page() may not cluster IO as efficiently as possible.
      
      This problem only occurs on filesystems with a block size smaller
      than page size.
      
      Also, rename xfs_is_delayed_page() to xfs_check_page_type() to
      better describe what it is doing - it is not delalloc specific
      anymore.
      
      The problem was first noticed by Peter Watkins.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      6ffc4db5
    • Dave Chinner's avatar
      xfs: Do background CIL flushes via a workqueue · 4c2d542f
      Dave Chinner authored
      Doing background CIL flushes adds significant latency to whatever
      async transaction that triggers it. To avoid blocking async
      transactions on things like waiting for log buffer IO to complete,
      move the CIL push off into a workqueue.  By moving the push work
      into a workqueue, we remove all the latency that the commit adds
      from the foreground transaction commit path. This also means that
      single threaded workloads won't do the CIL push procssing, leaving
      them more CPU to do more async transactions.
      
      To do this, we need to keep track of the sequence number we have
      pushed work for. This avoids having many transaction commits
      attempting to schedule work for the same sequence, and ensures that
      we only ever have one push (background or forced) in progress at a
      time. It also means that we don't need to take the CIL lock in write
      mode to check for potential background push races, which reduces
      lock contention.
      
      To avoid potential issues with "smart" IO schedulers, don't use the
      workqueue for log force triggered flushes. Instead, do them directly
      so that the log IO is done directly by the process issuing the log
      force and so doesn't get stuck on IO elevator queue idling
      incorrectly delaying the log IO from the workqueue.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      4c2d542f
    • Dave Chinner's avatar
      xfs: pass shutdown method into xfs_trans_ail_delete_bulk · 04913fdd
      Dave Chinner authored
      xfs_trans_ail_delete_bulk() can be called from different contexts so
      if the item is not in the AIL we need different shutdown for each
      context.  Pass in the shutdown method needed so the correct action
      can be taken.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      04913fdd
    • Christoph Hellwig's avatar
      a8569171
    • Christoph Hellwig's avatar
      xfs: on-stack delayed write buffer lists · 43ff2122
      Christoph Hellwig authored
      Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
      and write back the buffers per-process instead of by waking up xfsbufd.
      
      This is now easily doable given that we have very few places left that write
      delwri buffers:
      
       - log recovery:
      	Only done at mount time, and already forcing out the buffers
      	synchronously using xfs_flush_buftarg
      
       - quotacheck:
      	Same story.
      
       - dquot reclaim:
      	Writes out dirty dquots on the LRU under memory pressure.  We might
      	want to look into doing more of this via xfsaild, but it's already
      	more optimal than the synchronous inode reclaim that writes each
      	buffer synchronously.
      
       - xfsaild:
      	This is the main beneficiary of the change.  By keeping a local list
      	of buffers to write we reduce latency of writing out buffers, and
      	more importably we can remove all the delwri list promotions which
      	were hitting the buffer cache hard under sustained metadata loads.
      
      The implementation is very straight forward - xfs_buf_delwri_queue now gets
      a new list_head pointer that it adds the delwri buffers to, and all callers
      need to eventually submit the list using xfs_buf_delwi_submit or
      xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
      skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
      list.  The biggest change to pass down the buffer list was done to the AIL
      pushing. Now that we operate on buffers the trylock, push and pushbuf log
      item methods are merged into a single push routine, which tries to lock the
      item, and if possible add the buffer that needs writeback to the buffer list.
      This leads to much simpler code than the previous split but requires the
      individual IOP_PUSH instances to unlock and reacquire the AIL around calls
      to blocking routines.
      
      Given that xfsailds now also handle writing out buffers, the conditions for
      log forcing and the sleep times needed some small changes.  The most
      important one is that we consider an AIL busy as long we still have buffers
      to push, and the other one is that we do increment the pushed LSN for
      buffers that are under flushing at this moment, but still count them towards
      the stuck items for restart purposes.  Without this we could hammer on stuck
      items without ever forcing the log and not make progress under heavy random
      delete workloads on fast flash storage devices.
      
      [ Dave Chinner:
      	- rebase on previous patches.
      	- improved comments for XBF_DELWRI_Q handling
      	- fix XBF_ASYNC handling in queue submission (test 106 failure)
      	- rename delwri submit function buffer list parameters for clarity
      	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      43ff2122
    • Christoph Hellwig's avatar
      xfs: do not add buffers to the delwri queue until pushed · 960c60af
      Christoph Hellwig authored
      Instead of adding buffers to the delwri list as soon as they are logged,
      even if they can't be written until commited because they are pinned
      defer adding them to the delwri list until xfsaild pushes them.  This
      makes the code more similar to other log items and prepares for writing
      buffers directly from xfsaild.
      
      The complication here is that we need to fail buffers that were added
      but not logged yet in xfs_buf_item_unpin, borrowing code from
      xfs_bioerror.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      960c60af
    • Christoph Hellwig's avatar
      xfs: do not write the buffer from xfs_qm_dqflush · fe7257fd
      Christoph Hellwig authored
      Instead of writing the buffer directly from inside xfs_qm_dqflush return it
      to the caller and let the caller decide what to do with the buffer.  Also
      remove the pincount check in xfs_qm_dqflush that all non-blocking callers
      already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY
      check that all callers already perform.
      
      [ Dave Chinner: fixed build error cause by missing '{'. ]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      fe7257fd
    • Christoph Hellwig's avatar
      xfs: do not write the buffer from xfs_iflush · 4c46819a
      Christoph Hellwig authored
      Instead of writing the buffer directly from inside xfs_iflush return it to
      the caller and let the caller decide what to do with the buffer.  Also
      remove the pincount check in xfs_iflush that all non-blocking callers already
      implement and the now unused flags parameter.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      4c46819a
    • Christoph Hellwig's avatar
      xfs: don't flush inodes from background inode reclaim · 8a48088f
      Christoph Hellwig authored
      We already flush dirty inodes throug the AIL regularly, there is no reason
      to have second thread compete with it and disturb the I/O pattern.  We still
      do write inodes when doing a synchronous reclaim from the shrinker or during
      unmount for now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      8a48088f
    • Christoph Hellwig's avatar
      xfs: implement freezing by emptying the AIL · 211e4d43
      Christoph Hellwig authored
      Now that we write back all metadata either synchronously or through
      the AIL we can simply implement metadata freezing in terms of
      emptying the AIL.
      
      The implementation for this is fairly simply and straight-forward:
      A new routine is added that asks the xfsaild to push the AIL to the
      end and waits for it to complete and send a wakeup. The routine will
      then loop if the AIL is not actually empty, and continue to do so
      until the AIL is compeltely empty.
      
      We keep an inode reclaim pass in the freeze process to avoid having
      memory pressure have to reclaim inodes that require dirtying the
      filesystem to be reclaimed after the freeze has completed. This
      means we can also treat unmount in the exact same way as freeze.
      
      As an upside we can now remove the radix tree based inode writeback
      and xfs_unmountfs_writesb.
      
      [ Dave Chinner:
      	- Cleaned up commit message.
      	- Added inode reclaim passes back into freeze.
      	- Cleaned up wakeup mechanism to avoid the use of a new
      	  sleep counter variable. ]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      211e4d43