1. 01 Oct, 2014 12 commits
    • Dave Chinner's avatar
      75e58ce4
    • Christoph Hellwig's avatar
      xfs: simplify xfs_zero_remaining_bytes · 8c156125
      Christoph Hellwig authored
      xfs_zero_remaining_bytes() open codes a log of buffer manupulations
      to do a read forllowed by a write. It can simply be replaced by an
      uncached read followed by a xfs_bwrite() call.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8c156125
    • Dave Chinner's avatar
      xfs: check xfs_buf_read_uncached returns correctly · ba372674
      Dave Chinner authored
      xfs_buf_read_uncached() has two failure modes. If can either return
      NULL or bp->b_error != 0 depending on the type of failure, and not
      all callers check for both. Fix it so that xfs_buf_read_uncached()
      always returns the error status, and the buffer is returned as a
      function parameter. The buffer will only be returned on success.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      ba372674
    • Dave Chinner's avatar
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner authored
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      595bff75
    • Dave Chinner's avatar
      xfs: kill xfs_bioerror_relse · 8b131973
      Dave Chinner authored
      There is only one caller now - xfs_trans_read_buf_map() - and it has
      very well defined call semantics - read, synchronous, and b_iodone
      is NULL. Hence it's pretty clear what error handling is necessary
      for this case. The bigger problem of untangling
      xfs_trans_read_buf_map error handling is left to a future patch.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8b131973
    • Dave Chinner's avatar
      xfs: xfs_bioerror can die. · 27187754
      Dave Chinner authored
      Internal buffer write error handling is a mess due to the unnatural
      split between xfs_bioerror and xfs_bioerror_relse().
      
      xfs_bwrite() only does sync IO and determines the handler to
      call based on b_iodone, so for this caller the only difference
      between xfs_bioerror() and xfs_bioerror_release() is the XBF_DONE
      flag. We don't care what the XBF_DONE flag state is because we stale
      the buffer in both paths - the next buffer lookup will clear
      XBF_DONE because XBF_STALE is set. Hence we can use common
      error handling for xfs_bwrite().
      
      __xfs_buf_delwri_submit() is a similar - it's only ever called
      on writes - all sync or async - and again there's no reason to
      handle them any differently at all.
      
      Clean up the nasty error handling and remove xfs_bioerror().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      27187754
    • Dave Chinner's avatar
      xfs: kill xfs_bdstrat_cb · 8dac3921
      Dave Chinner authored
      Only has two callers, and is just a shutdown check and error handler
      around xfs_buf_iorequest. However, the error handling is a mess of
      read and write semantics, and both internal callers only call it for
      writes. Hence kill the wrapper, and follow up with a patch to
      sanitise the error handling.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      8dac3921
    • Dave Chinner's avatar
      xfs: rework xfs_buf_bio_endio error handling · 61be9c52
      Dave Chinner authored
      Currently the report of a bio error from completion
      immediately marks the buffer with an error. The issue is that this
      is racy w.r.t. synchronous IO - the submitter can see b_error being
      set before the IO is complete, and hence we cannot differentiate
      between submission failures and completion failures.
      
      Add an internal b_io_error field protected by the b_lock to catch IO
      completion errors, and only propagate that to the buffer during
      final IO completion handling. Hence we can tell in xfs_buf_iorequest
      if we've had a submission failure bey checking bp->b_error before
      dropping our b_io_remaining reference - that reference will prevent
      b_io_error values from being propagated to b_error in the event that
      completion races with submission.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      61be9c52
    • Dave Chinner's avatar
      xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality · e8aaba9a
      Dave Chinner authored
      We do some work in xfs_buf_ioend, and some work in
      xfs_buf_iodone_work, but much of that functionality is the same.
      This work can all be done in a single function, leaving
      xfs_buf_iodone just a wrapper to determine if we should execute it
      by workqueue or directly. hence rename xfs_buf_iodone_work to
      xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
      need async processing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e8aaba9a
    • Dave Chinner's avatar
      xfs: synchronous buffer IO needs a reference · e11bb805
      Dave Chinner authored
      When synchronous IO runs IO completion work, it does so without an
      IO reference or a hold reference on the buffer. The IO "hold
      reference" is owned by the submitter, and released when the
      submission is complete. The IO reference is released when both the
      submitter and the bio end_io processing is run, and so if the io
      completion work is run from IO completion context, it is run without
      an IO reference.
      
      Hence we can get the situation where the submitter can submit the
      IO, see an error on the buffer and unlock and free the buffer while
      there is still IO in progress. This leads to use-after-free and
      memory corruption.
      
      Fix this by taking a "sync IO hold" reference that is owned by the
      IO and not released until after the buffer completion calls are run
      to wake up synchronous waiters. This means that the buffer will not
      be freed in any circumstance until all IO processing is completed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      e11bb805
    • Dave Chinner's avatar
      xfs: Don't use xfs_buf_iowait in the delwri buffer code · cf53e99d
      Dave Chinner authored
      For the special case of delwri buffer submission and waiting, we
      don't need to issue IO synchronously at all. The second pass to call
      xfs_buf_iowait() can be replaced with  blocking on xfs_buf_lock() -
      the buffer will be unlocked when the async IO is complete.
      
      This formalises a sane the method of waiting for async IO - take an
      extra reference, submit the IO, call xfs_buf_lock() when you want to
      wait for IO completion. i.e.:
      
      	bp = xfs_buf_find();
      	xfs_buf_hold(bp);
      	bp->b_flags |= XBF_ASYNC;
      	xfs_buf_iosubmit(bp);
      	xfs_buf_lock(bp)
      	error = bp->b_error;
      	....
      	xfs_buf_relse(bp);
      
      While this is somewhat racy for gathering IO errors, none of the
      code that calls xfs_buf_delwri_submit() will race against other
      users of the buffers being submitted. Even if they do, we don't
      really care if the error is detected by the delwri code or the user
      we raced against. Either way, the error will be detected and
      handled.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      cf53e99d
    • Dave Chinner's avatar
      xfs: force the log before shutting down · a870fe6d
      Dave Chinner authored
      When we have marked the filesystem for shutdown, we want to prevent
      any further buffer IO from being submitted. However, we currently
      force the log after marking the filesystem as shut down, hence
      allowing IO to the log *after* we have marked both the filesystem
      and the log as in an error state.
      
      Clean this up by forcing the log before we mark the filesytem with
      an error. This replaces the pure CIL flush that we currently have
      which works around this same issue (i.e the CIL can't be flushed
      once the shutdown flags are set) and hence enables us to clean up
      the logic substantially.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      a870fe6d
  2. 29 Sep, 2014 6 commits
  3. 28 Sep, 2014 5 commits
  4. 23 Sep, 2014 15 commits
    • Dave Chinner's avatar
      33044dc4
    • Dave Chinner's avatar
      xfs: flush entire last page of old EOF on truncate up · 2ebff7bb
      Dave Chinner authored
      On a sub-page sized filesystem, truncating a mapped region down
      leaves us in a world of hurt. We truncate the pagecache, zeroing the
      newly unused tail, then punch blocks out from under the page. If we
      then truncate the file back up immediately, we expose that unmapped
      hole to a dirty page mapped into the user application, and that's
      where it all goes wrong.
      
      In truncating the page cache, we avoid unmapping the tail page of
      the cache because it still contains valid data. The problem is that
      it also contains a hole after the truncate, but nobody told the mm
      subsystem that. Therefore, if the page is dirty before the truncate,
      we'll never get a .page_mkwrite callout after we extend the file and
      the application writes data into the hole on the page.  Hence when
      we come to writing that region of the page, it has no blocks and no
      delayed allocation reservation and hence we toss the data away.
      
      This patch adds code to the truncate up case to solve it, by
      ensuring the partial page at the old EOF is always cleaned after we
      do any zeroing and move the EOF upwards. We can't actually serialise
      the page writeback and truncate against page faults (yes, that
      problem AGAIN) so this is really just a best effort and assumes it
      is extremely unlikely that someone is concurrently writing to the
      page at the EOF while extending the file.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2ebff7bb
    • Dave Chinner's avatar
      xfs: xfs_swap_extent_flush can be static · 7abbb8f9
      Dave Chinner authored
      Fix sparse warning introduced by commit 4ef897a2 ("xfs: flush both
      inodes in xfs_swap_extents").
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      7abbb8f9
    • Dave Chinner's avatar
      xfs: xfs_buf_write_fail_rl_state can be static · 02cc1876
      Dave Chinner authored
      Fix sparse warning introduced by commit ac8809f9 ("xfs: abort
      metadata writeback on permanent errors").
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      02cc1876
    • Fengguang Wu's avatar
      xfs: xfs_rtget_summary can be static · ea95961d
      Fengguang Wu authored
      Fix sparse warning introduced by commit afabfd30 ("xfs: combine
      xfs_rtmodify_summary and xfs_rtget_summary").
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      ea95961d
    • Fabian Frederick's avatar
      xfs: remove second xfs_quota.h inclusion in xfs_icache.c · e3cf1796
      Fabian Frederick authored
      xfs_quota.h was included twice.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      e3cf1796
    • Eric Sandeen's avatar
      xfs: don't ASSERT on corrupt ftype · fb040131
      Eric Sandeen authored
      xfs_dir3_data_get_ftype() gets the file type off disk, but ASSERTs
      if it's invalid:
      
           ASSERT(type < XFS_DIR3_FT_MAX);
      
      We shouldn't ASSERT on bad values read from disk.  V3 dirs are
      CRC-protected, but V2 dirs + ftype are not.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      fb040131
    • Dave Chinner's avatar
      xfs: xlog_cil_force_lsn doesn't always wait correctly · 8af3dcd3
      Dave Chinner authored
      When running a tight mount/unmount loop on an older kernel, RedHat
      QE found that unmount would occasionally hang in
      xfs_buf_unpin_wait() on the superblock buffer. Tracing and other
      debug work by Eric Sandeen indicated that it was hanging on the
      writing of the superblock during unmount immediately after logging
      the superblock counters in a synchronous transaction. Further debug
      indicated that the synchronous transaction was not waiting for
      completion correctly, and we narrowed it down to
      xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing
      the transaction in the iclog buffer to disk correctly.
      
      While this unmount superblock write code is now very different in
      mainline kernels, the xlog_cil_force_lsn() code is identical, and it
      was bisected to the backport of commit f876e446 ("xfs: always do log
      forces via the workqueue"). This commit made the CIL push
      asynchronous for log forces and hence exposed a race condition that
      couldn't occur on a synchronous push.
      
      Essentially, the xlog_cil_force_lsn() relied implicitly on the fact
      that the sequence push would be complete by the time
      xlog_cil_push_now() returned, resulting in the context being pushed
      being in the committing list. When it was made asynchronous, it was
      recognised that there was a race condition in detecting whether an
      asynchronous push has started or not and code was added to handle
      it.
      
      Unfortunately, the fix was not quite right and left a race condition
      where it it would detect an empty CIL while a push was in progress
      before the context had been added to the committing list. This was
      incorrectly seen as a "nothing to do" condition and so would tell
      xfs_log_force_lsn() that there is nothing to wait for, and hence it
      would push the iclogbufs in memory.
      
      The fix is simple, but explaining the logic and the race condition
      is a lot more complex. The fix is to add the context to the
      committing list before we start emptying the CIL. This allows us to
      detect the difference between an empty "do nothing" push and a push
      that has not started by adding a discrete "emptying the CIL" state
      to avoid the transient, incorrect "empty" condition that the
      (unchanged) waiting code was seeing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8af3dcd3
    • Dave Chinner's avatar
      f6d31f4b
    • Brian Foster's avatar
      xfs: only writeback and truncate pages for the freed range · 8b5279e3
      Brian Foster authored
      xfs_free_file_space() only affects the range of the file for which space
      is being freed. It currently writes and truncates the page cache from
      the start offset of the free to EOF.
      
      Modify xfs_free_file_space() to write back and truncate page cache of
      just the range being freed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8b5279e3
    • Brian Foster's avatar
      xfs: writeback and inval. file range to be shifted by collapse · f71721d0
      Brian Foster authored
      The collapse range operation currently writes the entire file before
      starting the collapse to avoid changes in the in-core extent list due to
      writeback causing the extent count to change. Now that collapse range is
      fsb based rather than extent index based it can sustain changes in the
      extent list during the shift sequence without disruption.
      
      Modify xfs_collapse_file_space() to writeback and invalidate pages
      associated with the range of the file to be shifted.
      xfs_free_file_space() currently has similar behavior, but the space free
      need only affect the region of the file that is freed and this could
      change in the future.
      
      Also update the comments to reflect the current implementation. We
      retain the eofblocks trim permanently as a best option for dealing with
      delalloc extents. We don't shift delalloc extents because this scenario
      only occurs with post-eof preallocation (since data must be flushed such
      that the cache can be invalidated and data can be shifted). That means
      said space must also be initialized before being shifted into the
      accessible region of the file only to be immediately truncated off as
      the last part of the collapse. In other words, the eofblocks trim will
      happen anyways, we just run it first to ensure the file remains in a
      consistent state throughout the collapse.
      
      Finally, detect and fail explicitly in the event of a delalloc extent
      during the extent shift. The implementation does not support delalloc
      extents and the caller is expected to prevent this scenario in advance
      as is done by collapse.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f71721d0
    • Brian Foster's avatar
      xfs: refactor single extent shift into xfs_bmse_shift_one() helper · a979bdfe
      Brian Foster authored
      xfs_bmap_shift_extents() has a variety of conditions and error checks
      that make the logic difficult to follow and indent heavy. Refactor the
      loop body of this function into a new xfs_bmse_shift_one() helper. This
      simplifies the error checks, eliminates index decrement on merge hack by
      pushing the index increment down into the helper, and makes the code
      more readable by reducing multiple levels of indentation.
      
      This is a code refactor only. The behavior of extent shift and collapse
      range is not modified.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      a979bdfe
    • Brian Foster's avatar
      xfs: refactor shift-by-merge into xfs_bmse_merge() helper · ddb19e31
      Brian Foster authored
      The extent shift mechanism in xfs_bmap_shift_extents() is complicated
      and handles several different, non-deterministic scenarios. These
      include extent shifts, extent merges and potential btree updates in
      either of the former scenarios.
      
      Refactor the code to be more linear and readable. The loop logic in
      xfs_bmap_shift_extents() and some initial error checking is adjusted
      slightly. The associated btree lookup and update/delete operations are
      condensed into single blocks of code. This reduces the number of
      btree-specific blocks and facilitates the separation of the merge
      operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.
      
      This is a code refactor only. The behavior of extent shift and collapse
      range is not modified.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      ddb19e31
    • Brian Foster's avatar
      xfs: track collapse via file offset rather than extent index · 2c845f5a
      Brian Foster authored
      The collapse range implementation uses a transaction per extent shift.
      The progress of the overall operation is tracked via the current extent
      index of the in-core extent list. This is racy because the ilock must be
      dropped and reacquired for each transaction according to locking and log
      reservation rules. Therefore, writeback to prior regions of the file is
      possible and can change the extent count. This changes the extent to
      which the current index refers and causes the collapse to fail mid
      operation. To avoid this problem, the entire file is currently written
      back before the collapse operation starts.
      
      To eliminate the need to flush the entire file, use the file offset
      (fsb) to track the progress of the overall extent shift operation rather
      than the extent index. Modify xfs_bmap_shift_extents() to
      unconditionally convert the start_fsb parameter to an extent index and
      return the file offset of the extent where the shift left off, if
      further extents exist. The bulk of ths function can remain based on
      extent index as ilock is held by the caller. xfs_collapse_file_space()
      now uses the fsb output as the starting point for the subsequent shift.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2c845f5a
    • Dave Chinner's avatar
      xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly · 0d085a52
      Dave Chinner authored
      XFS has been having trouble with stray delayed allocation extents
      beyond EOF for a long time. Recent changes to the collapse range
      code has triggered erroneous EBUSY errors on page invalidtion for
      block size smaller than page size filesystems. These
      have been caused by dirty buffers beyond EOF on a partial page which
      do not get written to disk during a sync.
      
      The issue is that write-ahead in xfs_cluster_write() finds such a
      partial page and handles it by leaving the page dirty but pushing it
      into a writeback state. This used to work just fine, as the
      write_cache_pages() code would then find the dirty partial page in
      the next mapping tree lookup as the dirty tag is still set.
      
      Unfortunately, when we moved to a mark and sweep approach to
      writeback to fix other writeback sync issues, we broken this. THe
      act of marking the page as under writeback now clears the TOWRITE
      tag in the radix tree, even though the page is still dirty. This
      causes the TOWRITE tag to be cleared, and hence the next lookup on
      the mapping tree does not find the dirty partial page and so doesn't
      try to write it again.
      
      This same writeback bug was found recently in ext4 and fixed in
      commit 1c8349a1 ("ext4: fix data integrity sync in ordered mode")
      without communication to the wider filesystem community. We can use
      exactly the same fix here so the TOWRITE flag is not cleared on
      partial page writes.
      
      cc: stable@vger.kernel.org # dependent on 1c8349a1Root-cause-found-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      0d085a52
  5. 09 Sep, 2014 2 commits