1. 14 May, 2012 22 commits
    • Dave Chinner's avatar
      xfs: prevent needless mount warning causing test failures · 81158e0c
      Dave Chinner authored
      Often mounting small filesystem with small logs will emit a warning
      such as:
      
      XFS (vdb): Invalid block length (0x2000) for buffer
      
      during log recovery. This causes tests to randomly fail because this
      output causes the clean filesystem checks on test completion to
      think the filesystem is inconsistent.
      
      The cause of the error is simply that log recovery is asking for a
      buffer size that is larger than the log when zeroing the tail. This
      is because the buffer size is rounded up, and if the right head and
      tail conditions exist then the buffer size can be larger than the log.
      Limit the variable size xlog_get_bp() callers to requesting buffers
      smaller than the log.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      81158e0c
    • Dave Chinner's avatar
      xfs: punch new delalloc blocks out of failed writes inside EOF. · d3bc815a
      Dave Chinner authored
      When a partial write inside EOF fails, it can leave delayed
      allocation blocks lying around because they don't get punched back
      out. This leads to assert failures like:
      
      XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847
      
      when evicting inodes from the cache. This can be trivially triggered
      by xfstests 083, which takes between 5 and 15 executions on a 512
      byte block size filesystem to trip over this. Debugging shows a
      failed write due to ENOSPC calling xfs_vm_write_failed such as:
      
      [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae
      
      and no action is taken on it. This leaves behind a delayed
      allocation extent that has no page covering it and no data in it:
      
      [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0
      [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1
      [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b
      [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1
                                          ^^^^^^^^^^^^^^^
      [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd
      [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa
      
      So the delayed allocation extent is one block long at offset
      0x16c00. Tracing shows that a bigger write:
      
      xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags
      
      allocates the block, and then fails with ENOSPC trying to allocate
      the last block on the page, leading to a failed write with stale
      delalloc blocks on it.
      
      Because we've had an ENOSPC when trying to allocate 0x16e00, it
      means that we are never goinge to call ->write_end on the page and
      so the allocated new buffer will not get marked dirty or have the
      buffer_new state cleared. In other works, what the above write is
      supposed to end up with is this mapping for the page:
      
          +------+------+------+------+------+------+------+------+
            UMA    UMA    UMA    UMA    UMA    UMA    UND    FAIL
      
      where:  U = uptodate
              M = mapped
              N = new
              A = allocated
              D = delalloc
              FAIL = block we ENOSPC'd on.
      
      and the key point being the buffer_new() state for the newly
      allocated delayed allocation block. Except it doesn't - we're not
      marking buffers new correctly.
      
      That buffer_new() problem goes back to the xfs_iomap removal days,
      where xfs_iomap() used to return a "new" status for any map with
      newly allocated blocks, so that __xfs_get_blocks() could call
      set_buffer_new() on it. We still have the "new" variable and the
      check for it in the set_buffer_new() logic - except we never set it
      now!
      
      Hence that newly allocated delalloc block doesn't have the new flag
      set on it, so when the write fails we cannot tell which blocks we
      are supposed to punch out. WHy do we need the buffer_new flag? Well,
      that's because we can have this case:
      
          +------+------+------+------+------+------+------+------+
            UMD    UMD    UMD    UMD    UMD    UMD    UND    FAIL
      
      where all the UMD buffers contain valid data from a previously
      successful write() system call. We only want to punch the UND buffer
      because that's the only one that we added in this write and it was
      only this write that failed.
      
      That implies that even the old buffer_new() logic was wrong -
      because it would result in all those UMD buffers on the page having
      set_buffer_new() called on them even though they aren't new. Hence
      we shoul donly be calling set_buffer_new() for delalloc buffers that
      were allocated (i.e. were a hole before xfs_iomap_write_delay() was
      called).
      
      So, fix this set_buffer_new logic according to how we need it to
      work for handling failed writes correctly. Also, restore the new
      buffer logic handling for blocks allocated via
      xfs_iomap_write_direct(), because it should still set the buffer_new
      flag appropriately for newly allocated blocks, too.
      
      SO, now we have the buffer_new() being set appropriately in
      __xfs_get_blocks(), we can detect the exact delalloc ranges that
      we allocated in a failed write, and hence can now do a walk of the
      buffers on a page to find them.
      
      Except, it's not that easy. When block_write_begin() fails, it
      unlocks and releases the page that we just had an error on, so we
      can't use that page to handle errors anymore. We have to get access
      to the page while it is still locked to walk the buffers. Hence we
      have to open code block_write_begin() in xfs_vm_write_begin() to be
      able to insert xfs_vm_write_failed() is the right place.
      
      With that, we can pass the page and write range to
      xfs_vm_write_failed() and walk the buffers on the page, looking for
      delalloc buffers that are either new or beyond EOF and punch them
      out. Handling buffers beyond EOF ensures we still handle the
      existing case that xfs_vm_write_failed() handles.
      
      Of special note is the truncate_pagecache() handling - that only
      should be done for pages outside EOF - pages within EOF can still
      contain valid, dirty data so we must not punch them out of the
      cache.
      
      That just leaves the xfs_vm_write_end() failure handling.
      The only failure case here is that we didn't copy the entire range,
      and generic_write_end() handles that by zeroing the region of the
      page that wasn't copied, we don't have to punch out blocks within
      the file because they are guaranteed to contain zeros. Hence we only
      have to handle the existing "beyond EOF" case and don't need access
      to the buffers on the page. Hence it remains largely unchanged.
      
      Note that xfs_getbmap() can still trip over delalloc blocks beyond
      EOF that are left there by speculative delayed allocation. Hence
      this bug fix does not solve all known issues with bmap vs delalloc,
      but it does fix all the the known accidental occurances of the
      problem.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d3bc815a
    • Dave Chinner's avatar
      xfs: page type check in writeback only checks last buffer · 6ffc4db5
      Dave Chinner authored
      xfs_is_delayed_page() checks to see if a page has buffers matching
      the given IO type passed in. It does so by walking the buffer heads
      on the page and checking if the state flags match the IO type.
      
      However, the "acceptable" variable that is calculated is overwritten
      every time a new buffer is checked. Hence if the first buffer on the
      page is of the right type, this state is lost if the second buffer
      is not of the correct type. This means that xfs_aops_discard_page()
      may not discard delalloc regions when it is supposed to, and
      xfs_convert_page() may not cluster IO as efficiently as possible.
      
      This problem only occurs on filesystems with a block size smaller
      than page size.
      
      Also, rename xfs_is_delayed_page() to xfs_check_page_type() to
      better describe what it is doing - it is not delalloc specific
      anymore.
      
      The problem was first noticed by Peter Watkins.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      6ffc4db5
    • Dave Chinner's avatar
      xfs: Do background CIL flushes via a workqueue · 4c2d542f
      Dave Chinner authored
      Doing background CIL flushes adds significant latency to whatever
      async transaction that triggers it. To avoid blocking async
      transactions on things like waiting for log buffer IO to complete,
      move the CIL push off into a workqueue.  By moving the push work
      into a workqueue, we remove all the latency that the commit adds
      from the foreground transaction commit path. This also means that
      single threaded workloads won't do the CIL push procssing, leaving
      them more CPU to do more async transactions.
      
      To do this, we need to keep track of the sequence number we have
      pushed work for. This avoids having many transaction commits
      attempting to schedule work for the same sequence, and ensures that
      we only ever have one push (background or forced) in progress at a
      time. It also means that we don't need to take the CIL lock in write
      mode to check for potential background push races, which reduces
      lock contention.
      
      To avoid potential issues with "smart" IO schedulers, don't use the
      workqueue for log force triggered flushes. Instead, do them directly
      so that the log IO is done directly by the process issuing the log
      force and so doesn't get stuck on IO elevator queue idling
      incorrectly delaying the log IO from the workqueue.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      4c2d542f
    • Dave Chinner's avatar
      xfs: pass shutdown method into xfs_trans_ail_delete_bulk · 04913fdd
      Dave Chinner authored
      xfs_trans_ail_delete_bulk() can be called from different contexts so
      if the item is not in the AIL we need different shutdown for each
      context.  Pass in the shutdown method needed so the correct action
      can be taken.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      04913fdd
    • Christoph Hellwig's avatar
      a8569171
    • Christoph Hellwig's avatar
      xfs: on-stack delayed write buffer lists · 43ff2122
      Christoph Hellwig authored
      Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
      and write back the buffers per-process instead of by waking up xfsbufd.
      
      This is now easily doable given that we have very few places left that write
      delwri buffers:
      
       - log recovery:
      	Only done at mount time, and already forcing out the buffers
      	synchronously using xfs_flush_buftarg
      
       - quotacheck:
      	Same story.
      
       - dquot reclaim:
      	Writes out dirty dquots on the LRU under memory pressure.  We might
      	want to look into doing more of this via xfsaild, but it's already
      	more optimal than the synchronous inode reclaim that writes each
      	buffer synchronously.
      
       - xfsaild:
      	This is the main beneficiary of the change.  By keeping a local list
      	of buffers to write we reduce latency of writing out buffers, and
      	more importably we can remove all the delwri list promotions which
      	were hitting the buffer cache hard under sustained metadata loads.
      
      The implementation is very straight forward - xfs_buf_delwri_queue now gets
      a new list_head pointer that it adds the delwri buffers to, and all callers
      need to eventually submit the list using xfs_buf_delwi_submit or
      xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
      skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
      list.  The biggest change to pass down the buffer list was done to the AIL
      pushing. Now that we operate on buffers the trylock, push and pushbuf log
      item methods are merged into a single push routine, which tries to lock the
      item, and if possible add the buffer that needs writeback to the buffer list.
      This leads to much simpler code than the previous split but requires the
      individual IOP_PUSH instances to unlock and reacquire the AIL around calls
      to blocking routines.
      
      Given that xfsailds now also handle writing out buffers, the conditions for
      log forcing and the sleep times needed some small changes.  The most
      important one is that we consider an AIL busy as long we still have buffers
      to push, and the other one is that we do increment the pushed LSN for
      buffers that are under flushing at this moment, but still count them towards
      the stuck items for restart purposes.  Without this we could hammer on stuck
      items without ever forcing the log and not make progress under heavy random
      delete workloads on fast flash storage devices.
      
      [ Dave Chinner:
      	- rebase on previous patches.
      	- improved comments for XBF_DELWRI_Q handling
      	- fix XBF_ASYNC handling in queue submission (test 106 failure)
      	- rename delwri submit function buffer list parameters for clarity
      	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      43ff2122
    • Christoph Hellwig's avatar
      xfs: do not add buffers to the delwri queue until pushed · 960c60af
      Christoph Hellwig authored
      Instead of adding buffers to the delwri list as soon as they are logged,
      even if they can't be written until commited because they are pinned
      defer adding them to the delwri list until xfsaild pushes them.  This
      makes the code more similar to other log items and prepares for writing
      buffers directly from xfsaild.
      
      The complication here is that we need to fail buffers that were added
      but not logged yet in xfs_buf_item_unpin, borrowing code from
      xfs_bioerror.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      960c60af
    • Christoph Hellwig's avatar
      xfs: do not write the buffer from xfs_qm_dqflush · fe7257fd
      Christoph Hellwig authored
      Instead of writing the buffer directly from inside xfs_qm_dqflush return it
      to the caller and let the caller decide what to do with the buffer.  Also
      remove the pincount check in xfs_qm_dqflush that all non-blocking callers
      already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY
      check that all callers already perform.
      
      [ Dave Chinner: fixed build error cause by missing '{'. ]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      fe7257fd
    • Christoph Hellwig's avatar
      xfs: do not write the buffer from xfs_iflush · 4c46819a
      Christoph Hellwig authored
      Instead of writing the buffer directly from inside xfs_iflush return it to
      the caller and let the caller decide what to do with the buffer.  Also
      remove the pincount check in xfs_iflush that all non-blocking callers already
      implement and the now unused flags parameter.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      4c46819a
    • Christoph Hellwig's avatar
      xfs: don't flush inodes from background inode reclaim · 8a48088f
      Christoph Hellwig authored
      We already flush dirty inodes throug the AIL regularly, there is no reason
      to have second thread compete with it and disturb the I/O pattern.  We still
      do write inodes when doing a synchronous reclaim from the shrinker or during
      unmount for now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      8a48088f
    • Christoph Hellwig's avatar
      xfs: implement freezing by emptying the AIL · 211e4d43
      Christoph Hellwig authored
      Now that we write back all metadata either synchronously or through
      the AIL we can simply implement metadata freezing in terms of
      emptying the AIL.
      
      The implementation for this is fairly simply and straight-forward:
      A new routine is added that asks the xfsaild to push the AIL to the
      end and waits for it to complete and send a wakeup. The routine will
      then loop if the AIL is not actually empty, and continue to do so
      until the AIL is compeltely empty.
      
      We keep an inode reclaim pass in the freeze process to avoid having
      memory pressure have to reclaim inodes that require dirtying the
      filesystem to be reclaimed after the freeze has completed. This
      means we can also treat unmount in the exact same way as freeze.
      
      As an upside we can now remove the radix tree based inode writeback
      and xfs_unmountfs_writesb.
      
      [ Dave Chinner:
      	- Cleaned up commit message.
      	- Added inode reclaim passes back into freeze.
      	- Cleaned up wakeup mechanism to avoid the use of a new
      	  sleep counter variable. ]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      211e4d43
    • Christoph Hellwig's avatar
      xfs: allow assigning the tail lsn with the AIL lock held · 1c304625
      Christoph Hellwig authored
      Provide a variant of xlog_assign_tail_lsn that has the AIL lock already
      held.  By doing so we do an additional atomic_read + atomic_set under
      the lock, which comes down to two instructions.
      
      Switch xfs_trans_ail_update_bulk and xfs_trans_ail_delete_bulk to the
      new version to reduce the number of lock roundtrips, and prepare for
      a new addition that would require a third lock roundtrip in
      xfs_trans_ail_delete_bulk.  This addition is also the reason for
      slightly rearranging the conditionals and relying on xfs_log_space_wake
      for checking that the filesystem has been shut down internally.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      1c304625
    • Christoph Hellwig's avatar
      xfs: remove log item from AIL in xfs_iflush after a shutdown · 32ce90a4
      Christoph Hellwig authored
      If a filesystem has been forced shutdown we are never going to write inodes
      to disk, which means the inode items will stay in the AIL until we free
      the inode. Currently that is not a problem, but a pending change requires us
      to empty the AIL before shutting down the filesystem. In that case leaving
      the inode in the AIL is lethal. Make sure to remove the log item from the AIL
      to allow emptying the AIL on shutdown filesystems.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      32ce90a4
    • Christoph Hellwig's avatar
      xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown · dea96095
      Christoph Hellwig authored
      If a filesystem has been forced shutdown we are never going to write dquots
      to disk, which means the dquot items will stay in the AIL forever.
      Currently that is not a problem, but a pending chance requires us to
      empty the AIL before shutting down the filesystem, in which case this
      behaviour is lethal.  Make sure to remove the log item from the AIL
      to allow emptying the AIL on shutdown filesystems.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      dea96095
    • Shaohua Li's avatar
      xfs: using GFP_NOFS for blkdev_issue_flush · 7582df51
      Shaohua Li authored
      Issuing a block device flush request in transaction context using GFP_KERNEL
      directly can cause deadlocks due to memory reclaim recursion. Use GFP_NOFS to
      avoid recursion from reclaim context.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      7582df51
    • Dave Chinner's avatar
      xfs: punch all delalloc blocks beyond EOF on write failure. · 01c84d2d
      Dave Chinner authored
      I've been seeing regular ASSERT failures in xfstests when running
      fsstress based tests over the past month. xfs_getbmap() has been
      failing this test:
      
      XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) ||
      (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c,
      line: 5650
      
      where it is encountering a delayed allocation extent after writing
      all the dirty data to disk and then walking the extent map
      atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed
      allocation extents from being created.
      
      Test 083 on a 512 byte block size filesystem was used to reproduce
      the problem, because it only had a 5s run timeand would usually fail
      every 3-4 runs. This test is exercising ENOSPC behaviour by running
      fsstress on a nearly full filesystem. The following trace extract
      shows the final few events on the inode that tripped the assert:
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_setfilesize
       xfs_setfilesize:       isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680
      
      file size updated to 0x180000 by IO completion
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_iext_insert:       state  idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay
       xfs_get_blocks_alloc:  size 0x180000 offset 0x180000 count 512 type  startoff 0xc00 startblock -1 blockcount 0x1
       xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks
      
      delalloc write, adding a single block at offset 0x180000
      
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180200 count 512
      
      ENOSPC trying to allocate a dellalloc block at offset 0x180200
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_get_blocks_alloc:  size 0x180000 offset 0x180200 count 512 type  startoff 0xc00 startblock -1 blockcount 0x2
      
      And succeeding on retry after flushing dirty inodes.
      
       xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512
      
      ENOSPC trying to allocate a dellalloc block at offset 0x180400
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512
      
      And failing the retry, giving a real ENOSPC error.
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_vm_write_failed
                                                      ^^^^^^^^^^^^^^^^^^^
      The smoking gun - the write being failed and cleaning up delalloc
      blocks beyond EOF allocated by the failed write.
      
       xfs_getattr:
       xfs_ilock:             flags IOLOCK_SHARED caller xfs_getbmap
       xfs_ilock:             flags ILOCK_SHARED caller xfs_ilock_map_shared
      
      And that's where we died almost immediately afterwards.
      xfs_bmapi_read() found delalloc extent beyond current file in memory
      file size. Some debug I added to xfs_getbmap() showed the state just
      before the assert failure:
      
       ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000
       start_fsb 0x106, end_fsb 0x638
       ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00
       ext 0: off 0x1fc, fsb 0x24782, len 0x254
       ext 1: off 0x450, fsb 0x40851, len 0x30
       ext 2: off 0x480, fsb 0xd99, len 0x1b8
       ext 3: off 0x92f, fsb 0x4099a, len 0x3b
       ext 4: off 0x96d, fsb 0x41844, len 0x98
       ext 5: off 0xbf1, fsb 0x408ab, len 0xf
      
      which shows that we found a single delalloc block beyond EOF (first
      line of output) when we were returning the map for a length
      somewhere around 10^16 bytes long (second line), and the on-disk
      extents showed they didn't go past EOF (last lines).
      
      Further debug added to xfs_vm_write_failed() showed this happened
      when punching out delalloc blocks beyond the end of the file after
      the failed write:
      
      [  132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000
      [  132.609573] start_fsb 0xc01, end_fsb 0xc08
      
      It punched the range 0xc01 -> 0xc08, but the range we really need to
      punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was
      run on a 512 byte block size filesystem (8 blocks per page).
      the punch from is 0xc00. So end_fsb is correct, but start_fsb is
      wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks.
      Hence we are not punching the delalloc block beyond EOF in the case.
      
      The fix is simple - it's a silly off-by-one mistake in calculating
      the range. It's especially silly because the macro used to calculate
      the start_fsb already takes into account the case where the inode
      size is an exact multiple of the filesystem block size...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      01c84d2d
    • Dave Chinner's avatar
      xfs: use shared ilock mode for direct IO writes by default · 507630b2
      Dave Chinner authored
      For the direct IO write path, we only really need the ilock to be taken in
      exclusive mode during IO submission if we need to do extent allocation
      instead of all the time.
      
      Change the block mapping code to take the ilock in shared mode for the
      initial block mapping, and only retake it exclusively when we actually
      have to perform extent allocations.  We were already dropping the ilock
      for the transaction allocation, so this doesn't introduce new race windows.
      
      Based on an earlier patch from Dave Chinner.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      507630b2
    • Christoph Hellwig's avatar
      xfs: push the ilock into xfs_zero_eof · 193aec10
      Christoph Hellwig authored
      Instead of calling xfs_zero_eof with the ilock held only take it internally
      for the minimall required critical section around xfs_bmapi_read.  This
      also requires changing the calling convention for xfs_zero_last_block
      slightly.  The actual zeroing operation is still serialized by the iolock,
      which must be taken exclusively over the call to xfs_zero_eof.
      
      We could in fact use a shared lock for the xfs_bmapi_read calls as long as
      the extent list has been read in, but given that we already hold the iolock
      exclusively there is little reason to micro optimize this further.
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      193aec10
    • Christoph Hellwig's avatar
      xfs: reduce ilock hold times in xfs_setattr_size · f38996f5
      Christoph Hellwig authored
      We do not need the ilock for most checks done in the beginning of
      xfs_setattr_size.  Replace the long critical section before starting the
      transaction with a smaller one around xfs_zero_eof and an optional one
      inside xfs_qm_dqattach that isn't entered unless using quotas.  While
      this isn't a big optimization for xfs_setattr_size itself it will allow
      pushing the ilock into xfs_zero_eof itself later.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      f38996f5
    • Christoph Hellwig's avatar
      xfs: reduce ilock hold times in xfs_file_aio_write_checks · 467f7899
      Christoph Hellwig authored
      We do not need the ilock for generic_write_checks and the i_size_read,
      which are protected by i_mutex and/or iolock, so reduce the ilock
      critical section to just the call to xfs_zero_eof.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      467f7899
    • Christoph Hellwig's avatar
      xfs: avoid taking the ilock unnessecarily in xfs_qm_dqattach · b4d05e30
      Christoph Hellwig authored
      Check if we actually need to attach a dquot before taking the ilock in
      xfs_qm_dqattach.  This avoid superflous lock roundtrips for the common cases
      of quota support compiled in but not activated on a filesystem and an
      inode that already has the dquots attached.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      b4d05e30
  2. 17 Apr, 2012 1 commit
    • Dave Chinner's avatar
      xfs: Ensure inode reclaim can run during quotacheck · 8a00ebe4
      Dave Chinner authored
      Because the mount process can run a quotacheck and consume lots of
      inodes, we need to be able to run periodic inode reclaim during the
      mount process. This will prevent running the system out of memory
      during quota checks.
      
      This essentially reverts 2bcf6e97, but that is safe to do now that
      the quota sync code that was causing problems during long quotacheck
      executions is now gone.
      
      The reclaim work is currently protected from running during the
      unmount process by a check against MS_ACTIVE. Unfortunately, this
      also means that the reclaim work cannot run during mount.  The
      unmount process should stop the reclaim cleanly before freeing
      anything that the reclaim work depends on, so there is no need to
      have this guard in place.
      
      Also, the inode reclaim work is demand driven, so there is no need
      to start it immediately during mount. It will be started the moment
      an inode is queued for reclaim, so qutoacheck will trigger it just
      fine.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      8a00ebe4
  3. 16 Apr, 2012 2 commits
  4. 08 Apr, 2012 1 commit
  5. 07 Apr, 2012 13 commits
    • Linus Torvalds's avatar
      Merge tag 'regmap-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap · f4e52e7f
      Linus Torvalds authored
      Pull two more small regmap fixes from Mark Brown:
       - Now we have users for it that aren't running Android it turns out
         that regcache_sync_region() is much more useful to drivers if it's
         exported for use by modules.  Who knew?
       - Make sure we don't divide by zero when doing debugfs dumps of
         rbtrees, not visible up until now because everything was providing at
         least some cache on startup.
      
      * tag 'regmap-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
        regmap: prevent division by zero in rbtree_show
        regmap: Export regcache_sync_region()
      f4e52e7f
    • Linus Torvalds's avatar
      Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm · a3fac080
      Linus Torvalds authored
      Pull a few KVM fixes from Avi Kivity:
       "A bunch of powerpc KVM fixes, a guest and a host RCU fix (unrelated),
        and a small build fix."
      
      * 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: Resolve RCU vs. async page fault problem
        KVM: VMX: vmx_set_cr0 expects kvm->srcu locked
        KVM: PMU: Fix integer constant is too large warning in kvm_pmu_set_msr()
        KVM: PPC: Book3S: PR: Fix preemption
        KVM: PPC: Save/Restore CR over vcpu_run
        KVM: PPC: Book3S HV: Save and restore CR in __kvmppc_vcore_entry
        KVM: PPC: Book3S HV: Fix kvm_alloc_linear in case where no linears exist
        KVM: PPC: Book3S: Compile fix for ppc32 in HIOR access code
      a3fac080
    • Linus Torvalds's avatar
      Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh · 664481ed
      Linus Torvalds authored
      Pull SuperH fixes from Paul Mundt.
      
      * tag 'sh-for-linus' of git://github.com/pmundt/linux-sh:
        sh: fix clock-sh7757 for the latest sh_mobile_sdhi driver
        serial: sh-sci: use serial_port_in/out vs sci_in/out.
        sh: vsyscall: Fix up .eh_frame generation.
        sh: dma: Fix up device attribute mismatch from sysdev fallout.
        sh: dwarf unwinder depends on SHcompact.
        sh: fix up fallout from system.h disintegration.
      664481ed
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · d6a624ee
      Linus Torvalds authored
      Pull security layer fixlet from James Morris.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
        sysctl: fix write access to dmesg_restrict/kptr_restrict
      d6a624ee
    • Linus Torvalds's avatar
      Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux · f21fec96
      Linus Torvalds authored
      Pull ACPI & Power Management patches from Len Brown:
       "Two fixes for cpuidle merge-window changes, plus a URL fix in
        MAINTAINERS"
      
      * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
        MAINTAINERS: Update git url for ACPI
        cpuidle: Fix panic in CPU off-lining with no idle driver
        ACPI processor: Use safe_halt() rather than halt() in acpi_idle_play_dead()
      f21fec96
    • Linus Torvalds's avatar
      Merge branch '3.4-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · a0421da4
      Linus Torvalds authored
      Pull target fixes from Nicholas Bellinger:
       "Pull two tcm_fc fabric related fixes for -rc2:
      
        Note that both have been CC'ed to stable, and patch #1 is the
        important one that addresses a memory corruption bug related to FC
        exchange timeouts + command abort.
      
        Thanks again to MDR for tracking down this issue!"
      
      * '3.4-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        tcm_fc: Do not free tpg structure during wq allocation failure
        tcm_fc: Add abort flag for gracefully handling exchange timeout
      a0421da4
    • Mark Rustad's avatar
      tcm_fc: Do not free tpg structure during wq allocation failure · 06383f10
      Mark Rustad authored
      Avoid freeing a registered tpg structure if an alloc_workqueue call
      fails.  This fixes a bug where the failure was leaking memory associated
      with se_portal_group setup during the original core_tpg_register() call.
      Signed-off-by: default avatarMark Rustad <mark.d.rustad@intel.com>
      Acked-by: default avatarKiran Patil <Kiran.patil@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      06383f10
    • Mark Rustad's avatar
      tcm_fc: Add abort flag for gracefully handling exchange timeout · e1c40382
      Mark Rustad authored
      Add abort flag and use it to terminate processing when an exchange
      is timed out or is reset. The abort flag is used in place of the
      transport_generic_free_cmd function call in the reset and timeout
      cases, because calling that function in that context would free
      memory that was in use. The aborted flag allows the lifetime to
      be managed in a more normal way, while truncating the processing.
      
      This change eliminates a source of memory corruption which
      manifested in a variety of ugly ways.
      
      (nab: Drop unused struct fc_exch *ep in ft_recv_seq)
      Signed-off-by: default avatarMark Rustad <mark.d.rustad@intel.com>
      Acked-by: default avatarKiran Patil <Kiran.patil@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      e1c40382
    • Len Brown's avatar
      eeaab2d8
    • Igor Murzov's avatar
      MAINTAINERS: Update git url for ACPI · aaef292a
      Igor Murzov authored
      Signed-off-by: default avatarIgor Murzov <e-mail@date.by>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      aaef292a
    • Linus Torvalds's avatar
      Merge branch 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile · 4157368e
      Linus Torvalds authored
      Pull arch/tile bug fixes from Chris Metcalf:
       "This includes Paul Gortmaker's change to fix the <asm/system.h>
        disintegration issues on tile, a fix to unbreak the tilepro ethernet
        driver, and a backlog of bugfix-only changes from internal Tilera
        development over the last few months.
      
        They have all been to LKML and on linux-next for the last few days.
        The EDAC change to MAINTAINERS is an oddity but discussion on the
        linux-edac list suggested I ask you to pull that change through my
        tree since they don't have a tree to pull edac changes from at the
        moment."
      
      * 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: (39 commits)
        drivers/net/ethernet/tile: fix netdev_alloc_skb() bombing
        MAINTAINERS: update EDAC information
        tilepro ethernet driver: fix a few minor issues
        tile-srom.c driver: minor code cleanup
        edac: say "TILEGx" not "TILEPro" for the tilegx edac driver
        arch/tile: avoid accidentally unmasking NMI-type interrupt accidentally
        arch/tile: remove bogus performance optimization
        arch/tile: return SIGBUS for addresses that are unaligned AND invalid
        arch/tile: fix finv_buffer_remote() for tilegx
        arch/tile: use atomic exchange in arch_write_unlock()
        arch/tile: stop mentioning the "kvm" subdirectory
        arch/tile: export the page_home() function.
        arch/tile: fix pointer cast in cacheflush.c
        arch/tile: fix single-stepping over swint1 instructions on tilegx
        arch/tile: implement panic_smp_self_stop()
        arch/tile: add "nop" after "nap" to help GX idle power draw
        arch/tile: use proper memparse() for "maxmem" options
        arch/tile: fix up locking in pgtable.c slightly
        arch/tile: don't leak kernel memory when we unload modules
        arch/tile: fix bug in delay_backoff()
        ...
      4157368e
    • Linus Torvalds's avatar
      Merge tag 'stable/for-linus-3.4-rc1-tag' of... · 9479f0f8
      Linus Torvalds authored
      Merge tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
      
      Pull xen fixes from Konrad Rzeszutek Wilk:
       "Two fixes for regressions:
         * one is a workaround that will be removed in v3.5 with proper fix in
           the tip/x86 tree,
         * the other is to fix drivers to load on PV (a previous patch made
           them only load in PVonHVM mode).
      
        The rest are just minor fixes in the various drivers and some cleanup
        in the core code."
      
      * tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
        xen/pcifront: avoid pci_frontend_enable_msix() falsely returning success
        xen/pciback: fix XEN_PCI_OP_enable_msix result
        xen/smp: Remove unnecessary call to smp_processor_id()
        xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'
        xen: only check xen_platform_pci_unplug if hvm
      9479f0f8
    • Linus Torvalds's avatar
      Merge tag 'mmc-fixes-for-3.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc · 1ddca057
      Linus Torvalds authored
      Pull MMC fixes from Chris Ball:
       - Disable use of MSI in sdhci-pci, which caused multiple chipsets to
         stop working in 3.4-rc1.  I'll wait to turn this on again until we
         have a chipset whitelist for it.
       - Fix a libertas SDIO powered-resume regression introduced in 3.3;
         thanks to Neil Brown and Rafael Wysocki for this fix.
       - Fix module reloading on omap_hsmmc.
       - Stop trusting the spec/card's specified maximum data timeout length,
         and use three seconds instead.  Previously we used 300ms.
      
      Also cleanups and fixes for s3c, atmel, sh_mmcif and omap_hsmmc.
      
      * tag 'mmc-fixes-for-3.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (28 commits)
        mmc: use really long write timeout to deal with crappy cards
        mmc: sdhci-dove: Fix compile error by including module.h
        mmc: Prevent 1.8V switch for SD hosts that don't support UHS modes.
        Revert "mmc: sdhci-pci: Add MSI support"
        Revert "mmc: sdhci-pci: add quirks for broken MSI on O2Micro controllers"
        mmc: core: fix power class selection
        mmc: omap_hsmmc: fix module re-insertion
        mmc: omap_hsmmc: convert to module_platform_driver
        mmc: omap_hsmmc: make it behave well as a module
        mmc: omap_hsmmc: trivial cleanups
        mmc: omap_hsmmc: context save after enabling runtime pm
        mmc: omap_hsmmc: use runtime put sync in probe error patch
        mmc: sdio: Use empty system suspend/resume callbacks at the bus level
        mmc: bus: print bus speed mode of UHS-I card
        mmc: sdhci-pci: add quirks for broken MSI on O2Micro controllers
        mmc: sh_mmcif: Simplify calculation of mmc->f_min
        mmc: sh_mmcif: mmc->f_max should be half of the bus clock
        mmc: sh_mmcif: double clock speed
        mmc: block: Remove use of mmc_blk_set_blksize
        mmc: atmel-mci: add support for odd clock dividers
        ...
      1ddca057
  6. 06 Apr, 2012 1 commit
    • Linus Torvalds's avatar
      Make the "word-at-a-time" helper functions more commonly usable · f68e556e
      Linus Torvalds authored
      I have a new optimized x86 "strncpy_from_user()" that will use these
      same helper functions for all the same reasons the name lookup code uses
      them.  This is preparation for that.
      
      This moves them into an architecture-specific header file.  It's
      architecture-specific for two reasons:
      
       - some of the functions are likely to want architecture-specific
         implementations.  Even if the current code happens to be "generic" in
         the sense that it should work on any little-endian machine, it's
         likely that the "multiply by a big constant and shift" implementation
         is less than optimal for an architecture that has a guaranteed fast
         bit count instruction, for example.
      
       - I expect that if architectures like sparc want to start playing
         around with this, we'll need to abstract out a few more details (in
         particular the actual unaligned accesses).  So we're likely to have
         more architecture-specific stuff if non-x86 architectures start using
         this.
      
         (and if it turns out that non-x86 architectures don't start using
         this, then having it in an architecture-specific header is still the
         right thing to do, of course)
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f68e556e