1. 08 Jul, 2011 8 commits
    • Christoph Hellwig's avatar
      xfs: improve sync behaviour in the face of aggressive dirtying · 33b8f7c2
      Christoph Hellwig authored
      The following script from Wu Fengguang shows very bad behaviour in XFS
      when aggressively dirtying data during a sync on XFS, with sync times
      up to almost 10 times as long as ext4.
      
      A large part of the issue is that XFS writes data out itself two times
      in the ->sync_fs method, overriding the livelock protection in the core
      writeback code, and another issue is the lock-less xfs_ioend_wait call,
      which doesn't prevent new ioend from being queue up while waiting for
      the count to reach zero.
      
      This patch removes the XFS-internal sync calls and relies on the VFS
      to do it's work just like all other filesystems do.  Note that the
      i_iocount wait which is rather suboptimal is simply removed here.
      We already do it in ->write_inode, which keeps the current supoptimal
      behaviour.  We'll eventually need to remove that as well, but that's
      material for a separate commit.
      
      ------------------------------ snip ------------------------------
      #!/bin/sh
      
      umount /dev/sda7
      mkfs.xfs -f /dev/sda7
      # mkfs.ext4 /dev/sda7
      # mkfs.btrfs /dev/sda7
      mount /dev/sda7 /fs
      
      echo $((50<<20)) > /proc/sys/vm/dirty_bytes
      
      pid=
      for i in `seq 10`
      do
      	dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 &
      	pid="$pid $!"
      done
      
      sleep 1
      
      tic=$(date +'%s')
      sync
      tac=$(date +'%s')
      
      echo
      echo sync time: $((tac-tic))
      egrep '(Dirty|Writeback|NFS_Unstable)' /proc/meminfo
      
      pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; }
      ------------------------------ snip ------------------------------
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      33b8f7c2
    • Christoph Hellwig's avatar
      xfs: split xfs_itruncate_finish · 8f04c47a
      Christoph Hellwig authored
      Split the guts of xfs_itruncate_finish that loop over the existing extents
      and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs.
      Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish,
      which allows to simplify the latter a lot, by only letting it deal with
      the data fork.  As a result xfs_itruncate_finish is renamed to
      xfs_itruncate_data to make its use case more obvious.
      
      Also remove the sync parameter from xfs_itruncate_data, which has been
      unessecary since the introduction of the busy extent list in 2002, and
      completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was
      made a no-op.
      
      I can't actually see why the xfs_attr_inactive needs to set the transaction
      sync, but let's keep this patch simple and without changes in behaviour.
      
      Also avoid passing a useless argument to xfs_isize_check, and make it
      private to xfs_inode.c.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      8f04c47a
    • Christoph Hellwig's avatar
      xfs: kill xfs_itruncate_start · 857b9778
      Christoph Hellwig authored
      xfs_itruncate_start is a rather length wrapper that evaluates to a call
      to xfs_ioend_wait and xfs_tosspages, and only has two callers.
      
      Instead of using the complicated checks left over from IRIX where we
      can to truncate the pagecache just call xfs_tosspages
      (aka truncate_inode_pages) directly as we want to get rid of all data
      after i_size, and truncate_inode_pages handles incorrect alignments
      and too large offsets just fine.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      857b9778
    • Christoph Hellwig's avatar
      xfs: always log timestamp updates in xfs_setattr_size · 681b1200
      Christoph Hellwig authored
      Get rid of the special case where we use unlogged timestamp updates for
      a truncate to the current inode size, and just call xfs_setattr_nonsize
      for it to treat it like a utimes calls.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      681b1200
    • Christoph Hellwig's avatar
      xfs: split xfs_setattr · c4ed4243
      Christoph Hellwig authored
      Split up xfs_setattr into two functions, one for the complex truncate
      handling, and one for the trivial attribute updates.  Also move both
      new routines to xfs_iops.c as they are fairly Linux-specific.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c4ed4243
    • Christoph Hellwig's avatar
      xfs: work around bogus gcc warning in xfs_allocbt_init_cursor · dec58f1d
      Christoph Hellwig authored
      GCC 4.6 complains about an array subscript is above array bounds when
      using the btree index to index into the agf_levels array.  The only
      two indices passed in are 0 and 1, and we have an assert insuring that.
      
      Replace the trick of using the array index directly with using constants
      in the already existing branch for assigning the XFS_BTREE_LASTREC_UPDATE
      flag.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      dec58f1d
    • Christoph Hellwig's avatar
      xfs: re-enable non-blocking behaviour in xfs_map_blocks · dbcdde3e
      Christoph Hellwig authored
      The non-blockig behaviour in xfs_vm_writepage currently is conditional on
      having both the WB_SYNC_NONE sync_mode and the nonblocking flag set.
      The latter used to be used by both pdflush, kswapd and a few other places
      in older kernels, but has been fading out starting with the introduction
      of the per-bdi flusher threads.
      
      Enable the non-blocking behaviour for all WB_SYNC_NONE calls to get back
      the behaviour we want.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      dbcdde3e
    • Christoph Hellwig's avatar
      xfs: PF_FSTRANS should never be set in ->writepage · 680a647b
      Christoph Hellwig authored
      Now that we reject direct reclaim in addition to always using GFP_NOFS
      allocation there's no chance we'll ever end up in ->writepage with
      PF_FSTRANS set.  Add a WARN_ON if we hit this case, and stop checking
      if we'd actually need to start a transaction.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      680a647b
  2. 06 Jul, 2011 1 commit
    • Dave Chinner's avatar
      xfs: unpin stale inodes directly in IOP_COMMITTED · 1316d4da
      Dave Chinner authored
      When inodes are marked stale in a transaction, they are treated
      specially when the inode log item is being inserted into the AIL.
      It tries to avoid moving the log item forward in the AIL due to a
      race condition with the writing the underlying buffer back to disk.
      The was "fixed" in commit de25c181 ("xfs: avoid moving stale inodes
      in the AIL").
      
      To avoid moving the item forward, we return a LSN smaller than the
      commit_lsn of the completing transaction, thereby trying to trick
      the commit code into not moving the inode forward at all. I'm not
      sure this ever worked as intended - it assumes the inode is already
      in the AIL, but I don't think the returned LSN would have been small
      enough to prevent moving the inode. It appears that the reason it
      worked is that the lower LSN of the inodes meant they were inserted
      into the AIL and flushed before the inode buffer (which was moved to
      the commit_lsn of the transaction).
      
      The big problem is that with delayed logging, the returning of the
      different LSN means insertion takes the slow, non-bulk path.  Worse
      yet is that insertion is to a position -before- the commit_lsn so it
      is doing a AIL traversal on every insertion, and has to walk over
      all the items that have already been inserted into the AIL. It's
      expensive.
      
      To compound the matter further, with delayed logging inodes are
      likely to go from clean to stale in a single checkpoint, which means
      they aren't even in the AIL at all when we come across them at AIL
      insertion time. Hence these were all getting inserted into the AIL
      when they simply do not need to be as inodes marked XFS_ISTALE are
      never written back.
      
      Transactional/recovery integrity is maintained in this case by the
      other items in the unlink transaction that were modified (e.g. the
      AGI btree blocks) and committed in the same checkpoint.
      
      So to fix this, simply unpin the stale inodes directly in
      xfs_inode_item_committed() and return -1 to indicate that the AIL
      insertion code does not need to do any further processing of these
      inodes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      1316d4da
  3. 24 Jun, 2011 3 commits
    • Dave Chinner's avatar
      xfs: prevent bogus assert when trying to remove non-existent attribute · 4a338212
      Dave Chinner authored
      If the attribute fork on an inode is in btree format and has
      multiple levels (i.e node format rather than leaf format), then a
      lookup failure will trigger an assert failure in xfs_da_path_shift
      if the flag XFS_DA_OP_OKNOENT is not set. This flag is used to
      indicate to the directory btree code that not finding an entry is
      not a fatal error. In the case of doing a lookup for a directory
      name removal, this is valid as a user cannot insert an arbitrary
      name to remove from the directory btree.
      
      However, in the case of the attribute tree, a user has direct
      control over the attribute name and can ask for any random name to
      be removed without any validation. In this case, fsstress is asking
      for a non-existent user.selinux attribute to be removed, and that is
      causing xfs_da_path_shift() to fall off the bottom of the tree where
      it asserts that a lookup failure is allowed. Because the flag is not
      set, we die a horrible death on a debug enable kernel.
      
      Prevent this assert from firing on attribute removes by adding the
      op_flag XFS_DA_OP_OKNOENT to atribute removal operations.
      
      Discovered when testing on a SELinux enabled system by fsstress in
      test 070 by trying to remove a non-existent user.selinux attribute.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      4a338212
    • Dave Chinner's avatar
      xfs: clear XFS_IDIRTY_RELEASE on truncate down · df4368a1
      Dave Chinner authored
      When an inode is truncated down, speculative preallocation is
      removed from the inode. This should also reset the state bits for
      controlling whether preallocation is subsequently removed when the
      file is next closed. The flag is not being cleared, so repeated
      operations on a file that first involve a truncate (e.g. multiple
      repeated dd invocations on a file) give different file layouts for
      the second and subsequent invocations.
      
      Fix this by clearing the XFS_IDIRTY_RELEASE state bit when the
      XFS_ITRUNCATED bit is detected in xfs_release() and hence ensure
      that speculative delalloc is removed on files that have been
      truncated down.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      df4368a1
    • Dave Chinner's avatar
      xfs: reset inode per-lifetime state when recycling it · 778e24bb
      Dave Chinner authored
      XFS inodes has several per-lifetime state fields that determine the
      behaviour of the inode. These state fields are not all reset when an
      inode is reused from the reclaimable state.
      
      This can lead to unexpected behaviour of the new inode such as
      speculative preallocation not being truncated away in the expected
      manner for local files until the inode is subsequently truncated,
      freed or cycles out of the cache. It can also lead to an inode being
      considered to be a filestream inode or having been truncated when
      that is not the case.
      
      Rework the reinitialisation of the inode when it is recycled to
      ensure that it is pristine before it is reused. While there, also
      fix the resetting of state flags in the recycling error paths so the
      inode does not become unreclaimable.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      778e24bb
  4. 16 Jun, 2011 1 commit
    • Christoph Hellwig's avatar
      xfs: make log devices with write back caches work · a27a263b
      Christoph Hellwig authored
      There's no reason not to support cache flushing on external log devices.
      The only thing this really requires is flushing the data device first
      both in fsync and log commits.  A side effect is that we also have to
      remove the barrier write test during mount, which has been superflous
      since the new FLUSH+FUA code anyway.  Also use the chance to flush the
      RT subvolume write cache before the fsync commit, which is required
      for correct semantics.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      a27a263b
  5. 14 Jun, 2011 1 commit
  6. 06 Jun, 2011 5 commits
  7. 04 Jun, 2011 17 commits
  8. 03 Jun, 2011 4 commits