1. 23 Sep, 2014 7 commits
    • Dave Chinner's avatar
      f6d31f4b
    • Brian Foster's avatar
      xfs: only writeback and truncate pages for the freed range · 8b5279e3
      Brian Foster authored
      xfs_free_file_space() only affects the range of the file for which space
      is being freed. It currently writes and truncates the page cache from
      the start offset of the free to EOF.
      
      Modify xfs_free_file_space() to write back and truncate page cache of
      just the range being freed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8b5279e3
    • Brian Foster's avatar
      xfs: writeback and inval. file range to be shifted by collapse · f71721d0
      Brian Foster authored
      The collapse range operation currently writes the entire file before
      starting the collapse to avoid changes in the in-core extent list due to
      writeback causing the extent count to change. Now that collapse range is
      fsb based rather than extent index based it can sustain changes in the
      extent list during the shift sequence without disruption.
      
      Modify xfs_collapse_file_space() to writeback and invalidate pages
      associated with the range of the file to be shifted.
      xfs_free_file_space() currently has similar behavior, but the space free
      need only affect the region of the file that is freed and this could
      change in the future.
      
      Also update the comments to reflect the current implementation. We
      retain the eofblocks trim permanently as a best option for dealing with
      delalloc extents. We don't shift delalloc extents because this scenario
      only occurs with post-eof preallocation (since data must be flushed such
      that the cache can be invalidated and data can be shifted). That means
      said space must also be initialized before being shifted into the
      accessible region of the file only to be immediately truncated off as
      the last part of the collapse. In other words, the eofblocks trim will
      happen anyways, we just run it first to ensure the file remains in a
      consistent state throughout the collapse.
      
      Finally, detect and fail explicitly in the event of a delalloc extent
      during the extent shift. The implementation does not support delalloc
      extents and the caller is expected to prevent this scenario in advance
      as is done by collapse.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f71721d0
    • Brian Foster's avatar
      xfs: refactor single extent shift into xfs_bmse_shift_one() helper · a979bdfe
      Brian Foster authored
      xfs_bmap_shift_extents() has a variety of conditions and error checks
      that make the logic difficult to follow and indent heavy. Refactor the
      loop body of this function into a new xfs_bmse_shift_one() helper. This
      simplifies the error checks, eliminates index decrement on merge hack by
      pushing the index increment down into the helper, and makes the code
      more readable by reducing multiple levels of indentation.
      
      This is a code refactor only. The behavior of extent shift and collapse
      range is not modified.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      a979bdfe
    • Brian Foster's avatar
      xfs: refactor shift-by-merge into xfs_bmse_merge() helper · ddb19e31
      Brian Foster authored
      The extent shift mechanism in xfs_bmap_shift_extents() is complicated
      and handles several different, non-deterministic scenarios. These
      include extent shifts, extent merges and potential btree updates in
      either of the former scenarios.
      
      Refactor the code to be more linear and readable. The loop logic in
      xfs_bmap_shift_extents() and some initial error checking is adjusted
      slightly. The associated btree lookup and update/delete operations are
      condensed into single blocks of code. This reduces the number of
      btree-specific blocks and facilitates the separation of the merge
      operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.
      
      This is a code refactor only. The behavior of extent shift and collapse
      range is not modified.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      ddb19e31
    • Brian Foster's avatar
      xfs: track collapse via file offset rather than extent index · 2c845f5a
      Brian Foster authored
      The collapse range implementation uses a transaction per extent shift.
      The progress of the overall operation is tracked via the current extent
      index of the in-core extent list. This is racy because the ilock must be
      dropped and reacquired for each transaction according to locking and log
      reservation rules. Therefore, writeback to prior regions of the file is
      possible and can change the extent count. This changes the extent to
      which the current index refers and causes the collapse to fail mid
      operation. To avoid this problem, the entire file is currently written
      back before the collapse operation starts.
      
      To eliminate the need to flush the entire file, use the file offset
      (fsb) to track the progress of the overall extent shift operation rather
      than the extent index. Modify xfs_bmap_shift_extents() to
      unconditionally convert the start_fsb parameter to an extent index and
      return the file offset of the extent where the shift left off, if
      further extents exist. The bulk of ths function can remain based on
      extent index as ilock is held by the caller. xfs_collapse_file_space()
      now uses the fsb output as the starting point for the subsequent shift.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2c845f5a
    • Dave Chinner's avatar
      xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly · 0d085a52
      Dave Chinner authored
      XFS has been having trouble with stray delayed allocation extents
      beyond EOF for a long time. Recent changes to the collapse range
      code has triggered erroneous EBUSY errors on page invalidtion for
      block size smaller than page size filesystems. These
      have been caused by dirty buffers beyond EOF on a partial page which
      do not get written to disk during a sync.
      
      The issue is that write-ahead in xfs_cluster_write() finds such a
      partial page and handles it by leaving the page dirty but pushing it
      into a writeback state. This used to work just fine, as the
      write_cache_pages() code would then find the dirty partial page in
      the next mapping tree lookup as the dirty tag is still set.
      
      Unfortunately, when we moved to a mark and sweep approach to
      writeback to fix other writeback sync issues, we broken this. THe
      act of marking the page as under writeback now clears the TOWRITE
      tag in the radix tree, even though the page is still dirty. This
      causes the TOWRITE tag to be cleared, and hence the next lookup on
      the mapping tree does not find the dirty partial page and so doesn't
      try to write it again.
      
      This same writeback bug was found recently in ext4 and fixed in
      commit 1c8349a1 ("ext4: fix data integrity sync in ordered mode")
      without communication to the wider filesystem community. We can use
      exactly the same fix here so the TOWRITE flag is not cleared on
      partial page writes.
      
      cc: stable@vger.kernel.org # dependent on 1c8349a1Root-cause-found-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      0d085a52
  2. 09 Sep, 2014 12 commits
    • Dave Chinner's avatar
      a4241aeb
    • Eric Sandeen's avatar
      xfs: remove rbpp check from xfs_rtmodify_summary_int · ab6978c2
      Eric Sandeen authored
      rbpp is always passed into xfs_rtmodify_summary
      and xfs_rtget_summary, so there is no need to
      test for it in xfs_rtmodify_summary_int.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      ab6978c2
    • Eric Sandeen's avatar
      xfs: combine xfs_rtmodify_summary and xfs_rtget_summary · afabfd30
      Eric Sandeen authored
      xfs_rtmodify_summary and xfs_rtget_summary are almost identical;
      fold them into xfs_rtmodify_summary_int(), with wrappers for each of
      the original calls.
      
      The _int function modifies if a delta is passed, and returns a
      summary pointer if *sum is passed.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      afabfd30
    • Eric Sandeen's avatar
      xfs: combine xfs_dir_canenter into xfs_dir_createname · b16ed7c1
      Eric Sandeen authored
      xfs_dir_canenter and xfs_dir_createname are
      almost identical.
      
      Fold the former into the latter, with a helpful
      wrapper for the former.  If createname is called without
      an inode number, it now only checks for space, and does
      not actually add the entry.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b16ed7c1
    • Eric Sandeen's avatar
      xfs: check resblks before calling xfs_dir_canenter · 94f3cad5
      Eric Sandeen authored
      Move the resblks test out of the xfs_dir_canenter,
      and into the caller.
      
      This makes a little more sense on the face of it;
      xfs_dir_canenter immediately returns if resblks !=0;
      and given some of the comments preceding the calls:
      
       * Check for ability to enter directory entry, if no space reserved.
      
      even more so.
      
      It also facilitates the next patch.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      94f3cad5
    • Eric Sandeen's avatar
      xfs: deduplicate xlog_do_recovery_pass() · 970fd3f0
      Eric Sandeen authored
      In xlog_do_recovery_pass(), there are 2 distinct cases:
      non-wrapped and wrapped log recovery.
      
      If we find a wrapped log, we recover around the end
      of the log, and then handle the rest of recovery
      exactly as in the non-wrapped case - using exactly the same
      (duplicated) code.
      
      Rather than having the same code in both cases, we can
      get the wrapped portion out of the way first if needed,
      and then recover the non-wrapped portion of the log.
      
      There should be no functional change here, just code
      reorganization & deduplication.
      
      The patch looks a bit bigger than it really is; the last
      hunk is whitespace changes (un-indenting).
      
      Tested with xfstests "check -g log" on a stock configuration.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      970fd3f0
    • Eric Sandeen's avatar
      xfs: lseek: the "whence" argument is called "whence" · 59f9c004
      Eric Sandeen authored
      For some reason, the older commit:
      
          965c8e59 lseek: the "whence" argument is called "whence"
      
          lseek: the "whence" argument is called "whence"
      
          But the kernel decided to call it "origin" instead.
          Fix most of the sites.
      
      left out xfs.  So fix xfs.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      59f9c004
    • Eric Sandeen's avatar
      xfs: combine xfs_seek_hole & xfs_seek_data · 49c69591
      Eric Sandeen authored
      xfs_seek_hole & xfs_seek_data are remarkably similar;
      so much so that they can be combined, saving a fair
      bit of semi-complex code duplication.
      
      The following patch passes generic/285 and generic/286,
      which specifically test seek behavior.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      49c69591
    • Brian Foster's avatar
      xfs: export log_recovery_delay to delay mount time log recovery · 2e227178
      Brian Foster authored
      XFS log recovery has been discovered to have race conditions with
      buffers when I/O errors occur. External tools are available to simulate
      I/O errors to XFS, but this alone is not sufficient for testing log
      recovery. XFS unconditionally resets the inactive region of the log
      prior to log recovery to avoid confusion over processing any partially
      written log records that might have been written before an unclean
      shutdown. Therefore, unconditional write I/O failures at mount time are
      caught by the reset sequence rather than log recovery and hinder the
      ability to test the latter.
      
      The device-mapper dm-flakey module uses an up/down timer to define a
      cycle for when to fail I/Os. Create a pre log recovery delay tunable
      that can be used to coordinate XFS log recovery with I/O errors
      simulated by dm-flakey. This facilitates coordination in userspace that
      allows the reset of stale log blocks to succeed and writes due to log
      recovery to fail. For example, define a dm-flakey instance with an
      uptime long enough to allow log reset to succeed and a log recovery
      delay long enough to allow the dm-flakey uptime to expire.
      
      The 'log_recovery_delay' sysfs tunable is exported under
      /sys/fs/xfs/debug and is only enabled for kernels compiled in XFS debug
      mode. The value is exported in units of seconds and allows for a delay
      of up to 60 seconds. Note that this is for XFS debug and test
      instrumentation purposes only and should not be used by applications. No
      delay is enabled by default.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2e227178
    • Brian Foster's avatar
      xfs: add debug sysfs attribute set · 65b65735
      Brian Foster authored
      Create a top-level debug directory for global debug sysfs attributes.
      This directory is added and removed on XFS module initialization and
      removal respectively for DEBUG mode kernels only. It typically resides
      at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
      hierarchy as attributes might define global behavior or behavior that
      must be configured before an xfs mount is available (e.g., log recovery
      behavior).
      
      Define the global debug kobject that represents the debug sysfs
      directory and add generic attribute show/store helpers to support future
      attributes. No debug attributes are exported as of yet.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      65b65735
    • Eric Sandeen's avatar
      xfs: add a few more verifier tests · e1b05723
      Eric Sandeen authored
      These were exposed by fsfuzzer runs; without them we fail
      in various exciting and sometimes convoluted ways when we
      encounter disk corruption.
      
      Without the MAXLEVELS tests we tend to walk off the end of
      an array in a loop like this:
      
              for (i = 0; i < cur->bc_nlevels; i++) {
                      if (cur->bc_bufs[i])
      
      Without the dirblklog test we try to allocate more memory
      than we could possibly hope for and loop forever:
      
      xfs_dabuf_map()
      	nfsb = mp->m_dir_geo->fsbcount;
      	irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...
      
      As for the logbsize check, that's the convoluted one.
      
      If logbsize is specified at mount time, it's sanitized
      in xfs_parseargs; in particular it makes sure that it's
      not > XLOG_MAX_RECORD_BSIZE.
      
      If not specified at mount time, it comes from the superblock
      via sb_logsunit; this is limited to 256k at mkfs time as well;
      it's copied into m_logbsize in xfs_finish_flags().
      
      However, if for some reason the on-disk value is corrupt and
      too large, nothing catches it.  It's a circuitous path, but
      that size eventually finds its way to places that make the kernel
      very unhappy, leading to oopses in xlog_pack_data() because we
      use the size as an index into iclog->ic_data, but the array
      is not necessarily that big.
      
      Anyway - bounds checking when we read from disk is a good thing!
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e1b05723
    • Brian Foster's avatar
      xfs: mark all internal workqueues as freezable · 8018ec08
      Brian Foster authored
      Workqueues must be explicitly set as freezable to ensure they are frozen
      in the assocated part of the hibernation/suspend sequence. Freezing of
      workqueues and kernel threads is important to ensure that modifications
      are not made on-disk after the hibernation image has been created.
      Otherwise, the in-memory state can become inconsistent with what is on
      disk and eventually lead to filesystem corruption. We have reports of
      free space btree corruptions that occur immediately after restore from
      hibernate that suggest the xfs-eofblocks workqueue could be causing
      such problems if it races with hibernation.
      
      Mark all of the internal XFS workqueues as freezable to ensure nothing
      changes on-disk once the freezer infrastructure freezes kernel threads
      and creates the hibernation image.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reported-by: default avatarCarlos E. R. <carlos.e.r@opensuse.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8018ec08
  3. 02 Sep, 2014 7 commits
    • Brian Foster's avatar
      xfs: trim eofblocks before collapse range · 41b9d726
      Brian Foster authored
      xfs_collapse_file_space() currently writes back the entire file
      undergoing collapse range to settle things down for the extent shift
      algorithm. While this prevents changes to the extent list during the
      collapse operation, the writeback itself is not enough to prevent
      unnecessary collapse failures.
      
      The current shift algorithm uses the extent index to iterate the in-core
      extent list. If a post-eof delalloc extent persists after the writeback
      (e.g., a prior zero range op where the end of the range aligns with eof
      can separate the post-eof blocks such that they are not written back and
      converted), xfs_bmap_shift_extents() becomes confused over the encoded
      br_startblock value and fails the collapse.
      
      As with the full writeback, this is a temporary fix until the algorithm
      is improved to cope with a volatile extent list and avoid attempts to
      shift post-eof extents.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      41b9d726
    • Dave Chinner's avatar
      xfs: xfs_file_collapse_range is delalloc challenged · 1669a8ca
      Dave Chinner authored
      If we have delalloc extents on a file before we run a collapse range
      opertaion, we sync the range that we are going to collapse to
      convert delalloc extents in that region to real extents to simplify
      the shift operation.
      
      However, the shift operation then assumes that the extent list is
      not going to change as it iterates over the extent list moving
      things about. Unfortunately, this isn't true because we can't hold
      the ILOCK over all the operations. We can prevent new IO from
      modifying the extent list by holding the IOLOCK, but that doesn't
      prevent writeback from running....
      
      And when writeback runs, it can convert delalloc extents is the
      range of the file prior to the region being collapsed, and this
      changes the indexes of all the extents in the file. That causes the
      collapse range operation to Go Bad.
      
      The right fix is to rewrite the extent shift operation not to be
      dependent on the extent list not changing across the entire
      operation, but this is a fairly significant piece of work to do.
      Hence, as a short-term workaround for the problem, sync the entire
      file before starting a collapse operation to remove all delalloc
      ranges from the file and so avoid the problem of concurrent
      writeback changing the extent list.
      Diagnosed-and-Reported-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      
      1669a8ca
    • Brian Foster's avatar
      xfs: don't log inode unless extent shift makes extent modifications · ca446d88
      Brian Foster authored
      The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
      all subsequent extents down into the specified, previously punched out,
      region. This function performs some validation, such as whether a
      sufficient hole exists in the target region of the collapse, then shifts
      the remaining exents downward.
      
      The exit path of the function currently logs the inode unconditionally.
      While we must log the inode (and abort) if an error occurs and the
      transaction is dirty, the initial validation paths can generate errors
      before the transaction has been dirtied. This creates an unnecessary
      filesystem shutdown scenario, as the caller will cancel a transaction
      that has been marked dirty.
      
      Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
      are made to the inode bmap. Only log the inode in the exit path if
      logflags has been set. This ensures we only have to cancel a dirty
      transaction if modifications have been made and prevents an unnecessary
      filesystem shutdown otherwise.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      ca446d88
    • Dave Chinner's avatar
      xfs: use ranged writeback and invalidation for direct IO · 7d4ea3ce
      Dave Chinner authored
      Now we are not doing silly things with dirtying buffers beyond EOF
      and using invalidation correctly, we can finally reduce the ranges of
      writeback and invalidation used by direct IO to match that of the IO
      being issued.
      
      Bring the writeback and invalidation ranges back to match the
      generic direct IO code - this will greatly reduce the perturbation
      of cached data when direct IO and buffered IO are mixed, but still
      provide the same buffered vs direct IO coherency behaviour we
      currently have.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      7d4ea3ce
    • Dave Chinner's avatar
      xfs: don't zero partial page cache pages during O_DIRECT writes · 834ffca6
      Dave Chinner authored
      Similar to direct IO reads, direct IO writes are using 
      truncate_pagecache_range to invalidate the page cache. This is
      incorrect due to the sub-block zeroing in the page cache that
      truncate_pagecache_range() triggers.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      834ffca6
    • Chris Mason's avatar
      xfs: don't zero partial page cache pages during O_DIRECT writes · 85e584da
      Chris Mason authored
      xfs is using truncate_pagecache_range to invalidate the page cache
      during DIO reads.  This is different from the other filesystems who
      only invalidate pages during DIO writes.
      
      truncate_pagecache_range is meant to be used when we are freeing the
      underlying data structs from disk, so it will zero any partial
      ranges in the page.  This means a DIO read can zero out part of the
      page cache page, and it is possible the page will stay in cache.
      
      buffered reads will find an up to date page with zeros instead of
      the data actually on disk.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      [dchinner: catch error and warn if it fails. Comment.]
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      85e584da
    • Dave Chinner's avatar
      xfs: don't dirty buffers beyond EOF · 22e757a4
      Dave Chinner authored
      generic/263 is failing fsx at this point with a page spanning
      EOF that cannot be invalidated. The operations are:
      
      1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
      1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
      1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
      
      where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
      write attempts to invalidate the cached page over this range, it
      fails with -EBUSY and so any attempt to do page invalidation fails.
      
      The real question is this: Why can't that page be invalidated after
      it has been written to disk and cleaned?
      
      Well, there's data on the first two buffers in the page (1k block
      size, 4k page), but the third buffer on the page (i.e. beyond EOF)
      is failing drop_buffers because it's bh->b_state == 0x3, which is
      BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
      what?
      
      OK, set_buffer_dirty() is called on all buffers from
      __set_page_buffers_dirty(), regardless of whether the buffer is
      beyond EOF or not, which means that when we get to ->writepage,
      we have buffers marked dirty beyond EOF that we need to clean.
      So, we need to implement our own .set_page_dirty method that
      doesn't dirty buffers beyond EOF.
      
      This is messy because the buffer code is not meant to be shared
      and it has interesting locking issues on the buffer dirty bits.
      So just copy and paste it and then modify it to suit what we need.
      
      Note: the solutions the other filesystems and generic block code use
      of marking the buffers clean in ->writepage does not work for XFS.
      It still leaves dirty buffers beyond EOF and invalidations still
      fail. Hence rather than play whack-a-mole, this patch simply
      prevents those buffers from being dirtied in the first place.
      
      cc: <stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      22e757a4
  4. 25 Aug, 2014 5 commits
    • Linus Torvalds's avatar
      Linux 3.17-rc2 · 52addcf9
      Linus Torvalds authored
      52addcf9
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · f01bfc97
      Linus Torvalds authored
      Pull NFS client fixes from Trond Myklebust:
       "Highlights:
      
         - more fixes for read/write codepath regressions
           * sleeping while holding the inode lock
           * stricter enforcement of page contiguity when coalescing requests
           * fix up error handling in the page coalescing code
      
         - don't busy wait on SIGKILL in the file locking code"
      
      * tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait
        nfs: can_coalesce_requests must enforce contiguity
        nfs: disallow duplicate pages in pgio page vectors
        nfs: don't sleep with inode lock in lock_and_join_requests
        nfs: fix error handling in lock_and_join_requests
        nfs: use blocking page_group_lock in add_request
        nfs: fix nonblocking calls to nfs_page_group_lock
        nfs: change nfs_page_group_lock argument
      f01bfc97
    • Linus Torvalds's avatar
      Merge tag 'renesas-sh-drivers-for-v3.17' of... · dd5957b7
      Linus Torvalds authored
      Merge tag 'renesas-sh-drivers-for-v3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas
      
      Pull SH driver fix from Simon Horman:
       "Confine SH_INTC to platforms that need it"
      
      * tag 'renesas-sh-drivers-for-v3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas:
        sh: intc: Confine SH_INTC to platforms that need it
      dd5957b7
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 497c01dd
      Linus Torvalds authored
      Pull MIPS fixes from Ralf Baechle:
       "Pretty much all across the field so with this we should be in
        reasonable shape for the upcoming -rc2"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: OCTEON: make get_system_type() thread-safe
        MIPS: CPS: Initialize EVA before bringing up VPEs from secondary cores
        MIPS: Malta: EVA: Rename 'eva_entry' to 'platform_eva_init'
        MIPS: EVA: Add new EVA header
        MIPS: scall64-o32: Fix indirect syscall detection
        MIPS: syscall: Fix AUDIT value for O32 processes on MIPS64
        MIPS: Loongson: Fix COP2 usage for preemptible kernel
        MIPS: NL: Fix nlm_xlp_defconfig build error
        MIPS: Remove race window in page fault handling
        MIPS: Malta: Improve system memory detection for '{e, }memsize' >= 2G
        MIPS: Alchemy: Fix db1200 PSC clock enablement
        MIPS: BCM47XX: Fix reboot problem on BCM4705/BCM4785
        MIPS: Remove duplicated include from numa.c
        MIPS: Add common plat_irq_dispatch declaration
        MIPS: MSP71xx: remove unused plat_irq_dispatch() argument
        MIPS: GIC: Remove useless parens from GICBIS().
        MIPS: perf: Mark pmu interupt IRQF_NO_THREAD
      497c01dd
    • Linus Torvalds's avatar
      Merge tag 'trace-fixes-v3.17-rc1' of... · 01e9982a
      Linus Torvalds authored
      Merge tag 'trace-fixes-v3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
      
      Pull fix for ftrace function tracer/profiler conflict from Steven Rostedt:
       "The rewrite of the ftrace code that makes it possible to allow for
        separate trampolines had a design flaw with the interaction between
        the function and function_graph tracers.
      
        The main flaw was the simplification of the use of multiple tracers
        having the same filter (like function and function_graph, that use the
        set_ftrace_filter file to filter their code).  The design assumed that
        the two tracers could never run simultaneously as only one tracer can
        be used at a time.  The problem with this assumption was that the
        function profiler could be implemented on top of the function graph
        tracer, and the function profiler could run at the same time as the
        function tracer.  This caused the assumption to be broken and when
        ftrace detected this failed assumpiton it would spit out a nasty
        warning and shut itself down.
      
        Instead of using a single ftrace_ops that switches between the
        function and function_graph callbacks, the two tracers can again use
        their own ftrace_ops.  But instead of having a complex hierarchy of
        ftrace_ops, the filter fields are placed in its own structure and the
        ftrace_ops can carefully use the same filter.  This change took a bit
        to be able to allow for this and currently only the global_ops can
        share the same filter, but this new design can easily be modified to
        allow for any ftrace_ops to share its filter with another ftrace_ops.
      
        The first four patches deal with the change of allowing the ftrace_ops
        to share the filter (and this needs to go to 3.16 as well).
      
        The fifth patch fixes a bug that was also caused by the new changes
        but only for archs other than x86, and only if those archs implement a
        direct call to the function_graph tracer which they do not do yet but
        will in the future.  It does not need to go to stable, but needs to be
        fixed before the other archs update their code to allow direct calls
        to the function_graph trampoline"
      
      * tag 'trace-fixes-v3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ftrace: Use current addr when converting to nop in __ftrace_replace_code()
        ftrace: Fix function_profiler and function tracer together
        ftrace: Fix up trampoline accounting with looping on hash ops
        ftrace: Update all ftrace_ops for a ftrace_hash_ops update
        ftrace: Allow ftrace_ops to use the hashes from other ops
      01e9982a
  5. 24 Aug, 2014 9 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7be141d0
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "A couple of EFI fixes, plus misc fixes all around the map"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/arm64: Store Runtime Services revision
        firmware: Do not use WARN_ON(!spin_is_locked())
        x86_32, entry: Clean up sysenter_badsys declaration
        x86/doc: Fix the 'tlb_single_page_flush_ceiling' sysconfig path
        x86/mm: Fix sparse 'tlb_single_page_flush_ceiling' warning and make the variable read-mostly
        x86/mm: Fix RCU splat from new TLB tracepoints
      7be141d0
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 44744bb3
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "A kprobes and a perf compat ioctl fix"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf: Handle compat ioctl
        kprobes: Skip kretprobe hit in NMI context to avoid deadlock
      44744bb3
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 959dc258
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "A collection of fixes from this week, it's been pretty quiet and
        nothing really stands out as particularly noteworthy here -- mostly
        minor fixes across the field:
      
         - ODROID booting was fixed due to PMIC interrupts missing in DT
         - a collection of i.MX fixes
         - minor Tegra fix for regulators
         - Rockchip fix and addition of SoC-specific mailing list to make it
           easier to find posted patches"
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        bus: arm-ccn: Fix warning message
        ARM: shmobile: koelsch: Remove non-existent i2c6 pinmux
        ARM: tegra: apalis/colibri t30: fix on-module 5v0 supplies
        MAINTAINERS: add new Rockchip SoC list
        ARM: dts: rockchip: readd missing mmc0 pinctrl settings
        ARM: dts: ODROID i2c improvements
        ARM: dts: Enable PMIC interrupts on ODROID
        ARM: dts: imx6sx: fix the pad setting for uart CTS_B
        ARM: dts: i.MX53: fix apparent bug in VPU clks
        ARM: imx: correct gpu2d_axi and gpu3d_axi clock setting
        ARM: dts: imx6: edmqmx6: change enet reset pin
        ARM: dts: vf610-twr: Fix pinctrl_esdhc1 pin definitions.
        ARM: imx: remove unnecessary ARCH_HAS_OPP select
        ARM: imx: fix TLB missing of IOMUXC base address during suspend
        ARM: imx6: fix SMP compilation again
        ARM: dt: sun6i: Add #address-cells and #size-cells to i2c controller nodes
      959dc258
    • Linus Torvalds's avatar
      Merge tag 'gpio-v3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · fa7f78e0
      Linus Torvalds authored
      Pull gpio fixes from Linus Walleij:
      
       - a largeish fix for the IRQ handling in the new Zynq driver.  The
         quite verbose commit message gives the exact details.
       - move some defines for gpiod flags outside an ifdef to make stub
         functions work again.
       - various minor fixes that we can accept for -rc1.
      
      * tag 'gpio-v3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
        gpio-lynxpoint: enable input sensing in resume
        gpio: move GPIOD flags outside #ifdef
        gpio: delete unneeded test before of_node_put
        gpio: zynq: Fix IRQ handlers
        gpiolib: devres: use correct structure type name in sizeof
        MAINTAINERS: Change maintainer for gpio-bcm-kona.c
      fa7f78e0
    • Linus Torvalds's avatar
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · 5e30ca1e
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Intel and radeon fixes.
      
        Post KS/LC git requests from i915 and radeon stacked up.  They are all
        fixes along with some new pci ids for radeon, and one maintainers file
        entry.
      
         - i915: display fixes and irq fixes
         - radeon: pci ids, and misc gpuvm, dpm and hdp cache"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (29 commits)
        MAINTAINERS: Add entry for Renesas DRM drivers
        drm/radeon: add additional SI pci ids
        drm/radeon: add new bonaire pci ids
        drm/radeon: add new KV pci id
        Revert "drm/radeon: Use write-combined CPU mappings of ring buffers with PCIe"
        drm/radeon: fix active_cu mask on SI and CIK after re-init (v3)
        drm/radeon: fix active cu count for SI and CIK
        drm/radeon: re-enable selective GPUVM flushing
        drm/radeon: Sync ME and PFP after CP semaphore waits v4
        drm/radeon: fix display handling in radeon_gpu_reset
        drm/radeon: fix pm handling in radeon_gpu_reset
        drm/radeon: Only flush HDP cache for indirect buffers from userspace
        drm/radeon: properly document reloc priority mask
        drm/i915: don't try to retrain a DP link on an inactive CRTC
        drm/i915: make sure VDD is turned off during system suspend
        drm/i915: cancel hotplug and dig_port work during suspend and unload
        drm/i915: fix HPD IRQ reenable work cancelation
        drm/i915: take display port power domain in DP HPD handler
        drm/i915: Don't try to enable cursor from setplane when crtc is disabled
        drm/i915: Skip load detect when intel_crtc->new_enable==true
        ...
      5e30ca1e
    • Benjamin LaHaise's avatar
      aio: fix reqs_available handling · d856f32a
      Benjamin LaHaise authored
      As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
      leak when events are reaped by userspace") introduces a regression when
      user code attempts to perform io_submit() with more events than are
      available in the ring buffer.  Reverting that commit would reintroduce a
      regression when user space event reaping is used.
      
      Fixing this bug is a bit more involved than the previous attempts to fix
      this regression.  Since we do not have a single point at which we can
      count events as being reaped by user space and io_getevents(), we have
      to track event completion by looking at the number of events left in the
      event ring.  So long as there are as many events in the ring buffer as
      there have been completion events generate, we cannot call
      put_reqs_available().  The code to check for this is now placed in
      refill_reqs_available().
      
      A test program from Dan and modified by me for verifying this bug is available
      at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
      Reported-by: default avatarDan Aloni <dan@kernelim.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Acked-by: default avatarDan Aloni <dan@kernelim.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d856f32a
    • Pawel Moll's avatar
      bus: arm-ccn: Fix warning message · bf87bb12
      Pawel Moll authored
      A message warning a user about wrong vc value was printing
      out port instead.
      Reported-by: default avatarDrew Richardson <drew.richardson@arm.com>
      Signed-off-by: default avatarPawel Moll <pawel.moll@arm.com>
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      bf87bb12
    • Geert Uytterhoeven's avatar
      ARM: shmobile: koelsch: Remove non-existent i2c6 pinmux · 12266db7
      Geert Uytterhoeven authored
      On r8a7791, i2c6 (aka iic3) doesn't need pinmux, but the koelsch dts
      refers to non-existent pinmux configuration data:
      
      pinmux core: sh-pfc does not support function i2c6
      sh-pfc e6060000.pfc: invalid function i2c6 in map table
      
      Remove it to fix this.
      
      Fixes: commit 1d41f36a ("ARM: shmobile:
             koelsch dts: Add VDD MPU regulator for DVFS")
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarSimon Horman <horms+renesas@verge.net.au>
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      12266db7
    • Marcel Ziswiler's avatar
      ARM: tegra: apalis/colibri t30: fix on-module 5v0 supplies · caa9eac5
      Marcel Ziswiler authored
      Working on Gigabit/PCIe support in U-Boot for Apalis T30 I realised
      that the current device tree source includes for our modules only
      happen to work due to referencing the on-carrier 5v0 supply from USB
      which is not at all available on-module. The modules actually contain
      TPS60150 charge pumps to generate the PMIC required 5 volts from the
      one and only 3.3 volt module supply. This patch fixes this.
      
      (Note: When back-porting this to v3.16 stable releases, simply drop the
      change to tegra30-apalis.dtsi; that file was added in v3.17)
      
      Cc: <stable@vger.kernel.org> #v3.16+
      Signed-off-by: default avatarMarcel Ziswiler <marcel@ziswiler.com>
      Signed-off-by: default avatarStephen Warren <swarren@nvidia.com>
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      caa9eac5