1. 04 Dec, 2018 5 commits
    • Darrick J. Wong's avatar
      iomap: partially revert 4721a601 (simulated directio short read on EFAULT) · 8f67b5ad
      Darrick J. Wong authored
      In commit 4721a601, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.  This causes infinite
      splice() loops and assertion failures on generic/095 on overlayfs
      because xfs only permit total success or total failure of a directio
      operation.  The underlying issue in the pipe splice code has now been
      fixed by changing the pipe splice loop to avoid avoid reading more data
      than there is space in the pipe.
      
      Therefore, it's no longer necessary to simulate the short directio, so
      remove the hack from iomap.
      
      Fixes: 4721a601 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: default avatarMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      8f67b5ad
    • Darrick J. Wong's avatar
      splice: don't read more than available pipe space · 17614445
      Darrick J. Wong authored
      In commit 4721a601, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.
      
      The brokenness is compounded by splice_direct_to_actor immediately
      bailing on do_splice_to returning <= 0 without ever calling ->actor
      (which empties out the pipe), so if userspace calls back we'll EFAULT
      again on the full pipe, and nothing ever gets copied.
      
      Therefore, teach splice_direct_to_actor to clamp its requests to the
      amount of free space in the pipe and remove the simulated short read
      from the iomap directio code.
      
      Fixes: 4721a601 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: default avatarMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      17614445
    • Darrick J. Wong's avatar
      vfs: allow some remap flags to be passed to vfs_clone_file_range · 6744557b
      Darrick J. Wong authored
      In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the
      lower filesystem's inode, passing through whatever remap flags it got
      from its caller.  Since vfs_copy_file_range first tries a filesystem's
      remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through
      to the second vfs_copy_file_range call, and this isn't an issue.
      Change the WARN_ON to look only for the DEDUP flag.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6744557b
    • Eric Sandeen's avatar
      xfs: fix inverted return from xfs_btree_sblock_verify_crc · 7d048df4
      Eric Sandeen authored
      xfs_btree_sblock_verify_crc is a bool so should not be returning
      a failaddr_t; worse, if xfs_log_check_lsn fails it returns
      __this_address which looks like a boolean true (i.e. success)
      to the caller.
      
      (interestingly xfs_btree_lblock_verify_crc doesn't have the issue)
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7d048df4
    • Darrick J. Wong's avatar
      xfs: fix PAGE_MASK usage in xfs_free_file_space · a579121f
      Darrick J. Wong authored
      In commit e53c4b59, I *tried* to teach xfs to force writeback when we
      fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
      the post-EOF part of the page gets zeroed before we return to userspace.
      Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
      which means that we totally fail to zero if we're fpunching and EOF is
      within the first page.  Worse yet, the same PAGE_MASK thinko plagues the
      filemap_write_and_wait_range call, so we'd initiate writeback of the
      entire file, which (mostly) masked the thinko.
      
      Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
      and the proper rounding macros.
      
      Fixes: e53c4b59 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a579121f
  2. 26 Nov, 2018 1 commit
  3. 21 Nov, 2018 7 commits
    • Dave Chinner's avatar
      iomap: readpages doesn't zero page tail beyond EOF · 8c110d43
      Dave Chinner authored
      When we read the EOF page of the file via readpages, we need
      to zero the region beyond EOF that we either do not read or
      should not contain data so that mmap does not expose stale data to
      user applications.
      
      However, iomap_adjust_read_range() fails to detect EOF correctly,
      and so fsx on 1k block size filesystems fails very quickly with
      mapreads exposing data beyond EOF. There are two problems here.
      
      Firstly, when calculating the end block of the EOF byte, we have
      to round the size by one to avoid a block aligned EOF from reporting
      a block too large. i.e. a size of 1024 bytes is 1 block, which in
      index terms is block 0. Therefore we have to calculate the end block
      from (isize - 1), not isize.
      
      The second bug is determining if the current page spans EOF, and so
      whether we need split it into two half, one for the IO, and the
      other for zeroing. Unfortunately, the code that checks whether
      we should split the block doesn't actually check if we span EOF, it
      just checks if the read spans the /offset in the page/ that EOF
      sits on. So it splits every read into two if EOF is not page
      aligned, regardless of whether we are reading the EOF block or not.
      
      Hence we need to restrict the "does the read span EOF" check to
      just the page that spans EOF, not every page we read.
      
      This patch results in correct EOF detection through readpages:
      
      xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
      iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
      iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
      iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
      
      As you can see, it now does full page reads until the last one which
      is split correctly at the block aligned EOF, reading 3072 bytes and
      zeroing the last 1024 bytes. The original version of the patch got
      this right, but it got another case wrong.
      
      The EOF detection crossing really needs to the the original length
      as plen, while it starts at the end of the block, will be shortened
      as up-to-date blocks are found on the page. This means "orig_pos +
      plen" no longer points to the end of the page, and so will not
      correctly detect EOF crossing. Hence we have to use the length
      passed in to detect this partial page case:
      
      xfs_filemap_fault:    dev 259:1 ino 0x43  write_fault 0
      xfs_vm_readpage:      dev 259:1 ino 0x43 nr_pages 1
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
      iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
      iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
      
      Heere we see a trace where the first block on the EOF page is up to
      date, hence poff = 1024 bytes. The offset into the page of EOF is
      3072, so the range we want to read is 1024 - 3071, and the range we
      want to zero is 3072 - 4095. You can see this is split correctly
      now.
      
      This fixes the stale data beyond EOF problem that fsx quickly
      uncovers on 1k block size filesystems.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      8c110d43
    • Dave Chinner's avatar
      vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP · 494633fa
      Dave Chinner authored
      It returns EINVAL when the operation is not supported by the
      filesystem. Fix it to return EOPNOTSUPP to be consistent with
      the man page and clone_file_range().
      
      Clean up the inconsistent error return handling while I'm there.
      (I know, lipstick on a pig, but every little bit helps...)
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      494633fa
    • Dave Chinner's avatar
      iomap: dio data corruption and spurious errors when pipes fill · 4721a601
      Dave Chinner authored
      When doing direct IO to a pipe for do_splice_direct(), then pipe is
      trivial to fill up and overflow as it can only hold 16 pages. At
      this point bio_iov_iter_get_pages() then returns -EFAULT, and we
      abort the IO submission process. Unfortunately, iomap_dio_rw()
      propagates the error back up the stack.
      
      The error is converted from the EFAULT to EAGAIN in
      generic_file_splice_read() to tell the splice layers that the pipe
      is full. do_splice_direct() completely fails to handle EAGAIN errors
      (it aborts on error) and returns EAGAIN to the caller.
      
      copy_file_write() then completely fails to handle EAGAIN as well,
      and so returns EAGAIN to userspace, having failed to copy the data
      it was asked to.
      
      Avoid this whole steaming pile of fail by having iomap_dio_rw()
      silently swallow EFAULT errors and so do short reads.
      
      To make matters worse, iomap_dio_actor() has a stale data exposure
      bug bio_iov_iter_get_pages() fails - it does not zero the tail block
      that it may have been left uncovered by partial IO. Fix the error
      handling case to drop to the sub-block zeroing rather than
      immmediately returning the -EFAULT error.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      4721a601
    • Dave Chinner's avatar
      iomap: sub-block dio needs to zeroout beyond EOF · b450672f
      Dave Chinner authored
      If we are doing sub-block dio that extends EOF, we need to zero
      the unused tail of the block to initialise the data in it it. If we
      do not zero the tail of the block, then an immediate mmap read of
      the EOF block will expose stale data beyond EOF to userspace. Found
      with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
      
      Fix this by detecting if the end of the DIO write is beyond EOF
      and zeroing the tail if necessary.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      b450672f
    • Dave Chinner's avatar
      iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents · 0929d858
      Dave Chinner authored
      When we write into an unwritten extent via direct IO, we dirty
      metadata on IO completion to convert the unwritten extent to
      written. However, when we do the FUA optimisation checks, the inode
      may be clean and so we issue a FUA write into the unwritten extent.
      This means we then bypass the generic_write_sync() call after
      unwritten extent conversion has ben done and we don't force the
      modified metadata to stable storage.
      
      This violates O_DSYNC semantics. The window of exposure is a single
      IO, as the next DIO write will see the inode has dirty metadata and
      hence will not use the FUA optimisation. Calling
      generic_write_sync() after completion of the second IO will also
      sync the first write and it's metadata.
      
      Fix this by avoiding the FUA optimisation when writing to unwritten
      extents.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0929d858
    • Dave Chinner's avatar
      xfs: delalloc -> unwritten COW fork allocation can go wrong · 9230a0b6
      Dave Chinner authored
      Long saga. There have been days spent following this through dead end
      after dead end in multi-GB event traces. This morning, after writing
      a trace-cmd wrapper that enabled me to be more selective about XFS
      trace points, I discovered that I could get just enough essential
      tracepoints enabled that there was a 50:50 chance the fsx config
      would fail at ~115k ops. If it didn't fail at op 115547, I stopped
      fsx at op 115548 anyway.
      
      That gave me two traces - one where the problem manifested, and one
      where it didn't. After refining the traces to have the necessary
      information, I found that in the failing case there was a real
      extent in the COW fork compared to an unwritten extent in the
      working case.
      
      Walking back through the two traces to the point where the CWO fork
      extents actually diverged, I found that the bad case had an extra
      unwritten extent in it. This is likely because the bug it led me to
      had triggered multiple times in those 115k ops, leaving stray
      COW extents around. What I saw was a COW delalloc conversion to an
      unwritten extent (as they should always be through
      xfs_iomap_write_allocate()) resulted in a /written extent/:
      
      xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
      xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
      
      Basically, Cow fork before:
      
      	0 1            32          52
      	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
      	   PREV		RIGHT
      
      COW delalloc conversion allocates:
      
      	  1	       32
      	  +uuuuuuuuuuuu+
      	  NEW
      
      And the result according to the xfs_bmap_post_update trace was:
      
      	0 1            32          52
      	+H+wwwwwwwwwwwwwwwwwwwwwwww+
      	   PREV
      
      Which is clearly wrong - it should be a merged unwritten extent,
      not an unwritten extent.
      
      That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
      case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
      the bug.
      
      It takes the old delalloc extent (PREV) and adds the length of the
      RIGHT extent to it, takes the start block from NEW, removes the
      RIGHT extent and then updates PREV with the new extent.
      
      What it fails to do is update PREV.br_state. For delalloc, this is
      always XFS_EXT_NORM, while in this case we are converting the
      delayed allocation to unwritten, so it needs to be updated to
      XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
      the resultant extent is always written.
      
      And that's the bug I've been chasing for a week - a bmap btree bug,
      not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
      with the recent in core extent tree scalability enhancements.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      9230a0b6
    • Dave Chinner's avatar
      xfs: flush removing page cache in xfs_reflink_remap_prep · 2c307174
      Dave Chinner authored
      On a sub-page block size filesystem, fsx is failing with a data
      corruption after a series of operations involving copying a file
      with the destination offset beyond EOF of the destination of the file:
      
      8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
      8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
      8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
      8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
      8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00
      
      The second copy here is beyond EOF, and it is to sub-page (4k) but
      block aligned (1k) offset. The clone runs the EOF zeroing, landing
      in a pre-existing post-eof delalloc extent. This zeroes the post-eof
      extents in the page cache just fine, dirtying the pages correctly.
      
      The problem is that xfs_reflink_remap_prep() now truncates the page
      cache over the range that it is copying it to, and rounds that down
      to cover the entire start page. This removes the dirty page over the
      delalloc extent from the page cache without having written it back.
      Hence later, when the page cache is flushed, the page at offset
      0x6f000 has not been written back and hence exposes stale data,
      which fsx trips over less than 10 operations later.
      
      Fix this by changing xfs_reflink_remap_prep() to use
      xfs_flush_unmap_range().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      2c307174
  4. 20 Nov, 2018 4 commits
    • Dave Chinner's avatar
      xfs: extent shifting doesn't fully invalidate page cache · 7f9f71be
      Dave Chinner authored
      The extent shifting code uses a flush and invalidate mechainsm prior
      to shifting extents around. This is similar to what
      xfs_free_file_space() does, but it doesn't take into account things
      like page cache vs block size differences, and it will fail if there
      is a page that it currently busy.
      
      xfs_flush_unmap_range() handles all of these cases, so just convert
      xfs_prepare_shift() to us that mechanism rather than having it's own
      special sauce.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7f9f71be
    • Dave Chinner's avatar
      xfs: finobt AG reserves don't consider last AG can be a runt · c0876897
      Dave Chinner authored
      The last AG may be very small comapred to all other AGs, and hence
      AG reservations based on the superblock AG size may actually consume
      more space than the AG actually has. This results on assert failures
      like:
      
      XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
      [   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
      [   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
      [   48.934939]  xfs_mountfs+0x5b3/0x920
      [   48.935804]  xfs_fs_fill_super+0x462/0x640
      [   48.936784]  ? xfs_test_remount_options+0x60/0x60
      [   48.937908]  mount_bdev+0x178/0x1b0
      [   48.938751]  mount_fs+0x36/0x170
      [   48.939533]  vfs_kern_mount.part.43+0x54/0x130
      [   48.940596]  do_mount+0x20e/0xcb0
      [   48.941396]  ? memdup_user+0x3e/0x70
      [   48.942249]  ksys_mount+0xba/0xd0
      [   48.943046]  __x64_sys_mount+0x21/0x30
      [   48.943953]  do_syscall_64+0x54/0x170
      [   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Hence we need to ensure the finobt per-ag space reservations take
      into account the size of the last AG rather than treat it like all
      the other full size AGs.
      
      Note that both refcountbt and rmapbt already take the size of the AG
      into account via reading the AGF length directly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      c0876897
    • Dave Chinner's avatar
      xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers · d43aaf16
      Dave Chinner authored
      When retrying a failed inode or dquot buffer,
      xfs_buf_resubmit_failed_buffers() clears all the failed flags from
      the inde/dquot log items. In doing so, it also drops all the
      reference counts on the buffer that the failed log items hold. This
      means it can drop all the active references on the buffer and hence
      free the buffer before it queues it for write again.
      
      Putting the buffer on the delwri queue takes a reference to the
      buffer (so that it hangs around until it has been written and
      completed), but this goes bang if the buffer has already been freed.
      
      Hence we need to add the buffer to the delwri queue before we remove
      the failed flags from the log items attached to the buffer to ensure
      it always remains referenced during the resubmit process.
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      d43aaf16
    • Dave Chinner's avatar
      xfs: uncached buffer tracing needs to print bno · d61fa8cb
      Dave Chinner authored
      Useless:
      
      xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_submit:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_hold:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iowait:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iodone:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iowait_done:  dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_rele:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      
      Useful:
      
      
      xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_submit:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_hold:         dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iowait:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iodone:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iowait_done:  dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_rele:         dev 253:32 bno 0x200b5 nblks 0x1 ...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      d61fa8cb
  5. 19 Nov, 2018 2 commits
    • Eric Biggers's avatar
      xfs: make xfs_file_remap_range() static · da034bcc
      Eric Biggers authored
      xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
      static.
      
      This addresses a gcc warning when -Wmissing-prototypes is enabled.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      da034bcc
    • Brian Foster's avatar
      xfs: fix shared extent data corruption due to missing cow reservation · 59e42931
      Brian Foster authored
      Page writeback indirectly handles shared extents via the existence
      of overlapping COW fork blocks. If COW fork blocks exist, writeback
      always performs the associated copy-on-write regardless if the
      underlying blocks are actually shared. If the blocks are shared,
      then overlapping COW fork blocks must always exist.
      
      fstests shared/010 reproduces a case where a buffered write occurs
      over a shared block without performing the requisite COW fork
      reservation.  This ultimately causes writeback to the shared extent
      and data corruption that is detected across md5 checks of the
      filesystem across a mount cycle.
      
      The problem occurs when a buffered write lands over a shared extent
      that crosses an extent size hint boundary and that also happens to
      have a partial COW reservation that doesn't cover the start and end
      blocks of the data fork extent.
      
      For example, a buffered write occurs across the file offset (in FSB
      units) range of [29, 57]. A shared extent exists at blocks [29, 35]
      and COW reservation already exists at blocks [32, 34]. After
      accommodating a COW extent size hint of 32 blocks and the existing
      reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
      blocks of reservation at offset 0 and returns with COW reservation
      across the range of [0, 34]. The associated data fork extent is
      still [29, 35], however, which isn't fully covered by the COW
      reservation.
      
      This leads to a buffered write at file offset 35 over a shared
      extent without associated COW reservation. Writeback eventually
      kicks in, performs an overwrite of the underlying shared block and
      causes the associated data corruption.
      
      Update xfs_reflink_reserve_cow() to accommodate the fact that a
      delalloc allocation request may not fully cover the extent in the
      data fork. Trim the data fork extent appropriately, just as is done
      for shared extent boundaries and/or existing COW reservations that
      happen to overlap the start of the data fork extent. This prevents
      shared/010 failures due to data corruption on reflink enabled
      filesystems.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      59e42931
  6. 06 Nov, 2018 3 commits
    • Dave Chinner's avatar
      xfs: fix overflow in xfs_attr3_leaf_verify · 837514f7
      Dave Chinner authored
      generic/070 on 64k block size filesystems is failing with a verifier
      corruption on writeback or an attribute leaf block:
      
      [   94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
      [   94.975623] XFS (pmem0): Unmount and run xfs_repair
      [   94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
      [   94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      [   94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00  ................
      [   94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6  "{\.-\GL.1.7....
      [   94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00  ................
      [   94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00  .......h........
      [   94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00  .-2.. ...-Xe....
      [   94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00  .-[f.....-q5.|..
      [   94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00  .-r7.....-......
      [   94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3
      
      This is failing this check:
      
                      end = ichdr.freemap[i].base + ichdr.freemap[i].size;
                      if (end < ichdr.freemap[i].base)
      >>>>>                   return __this_address;
                      if (end > mp->m_attr_geo->blksize)
                              return __this_address;
      
      And from the buffer output above, the freemap array is:
      
      	freemap[0].base = 0x00a0
      	freemap[0].size = 0xdcf4	end = 0xdd94
      	freemap[1].base = 0xfe98
      	freemap[1].size = 0x0168	end = 0x10000
      	freemap[2].base = 0xf0d8
      	freemap[2].size = 0x07e0	end = 0xf8b8
      
      These all look valid - the block size is 0x10000 and so from the
      last check in the above verifier fragment we know that the end
      of freemap[1] is valid. The problem is that end is declared as:
      
      	uint16_t	end;
      
      And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
      corruption. Fix the verifier to use uint32_t types for the check and
      hence avoid the overflow.
      
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      837514f7
    • Darrick J. Wong's avatar
      xfs: print buffer offsets when dumping corrupt buffers · bdec055b
      Darrick J. Wong authored
      Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
      because modern Linux now prints a 32-bit hash of our 64-bit pointer when
      using DUMP_PREFIX_ADDRESS:
      
      00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00  ......Z.........
      000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48  !.....L...<...|H
      
      This is totally worthless for a sequential dump since we probably only
      care about tracking the buffer offsets and afaik there's no way to
      recover the actual pointer from the hashed value.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      bdec055b
    • Christophe JAILLET's avatar
      xfs: Fix error code in 'xfs_ioc_getbmap()' · 132bf672
      Christophe JAILLET authored
      In this function, once 'buf' has been allocated, we unconditionally
      return 0.
      However, 'error' is set to some error codes in several error handling
      paths.
      Before commit 232b5194 ("xfs: simplify the xfs_getbmap interface")
      this was not an issue because all error paths were returning directly,
      but now that some cleanup at the end may be needed, we must propagate the
      error code.
      
      Fixes: 232b5194 ("xfs: simplify the xfs_getbmap interface")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      132bf672
  7. 04 Nov, 2018 9 commits
    • Linus Torvalds's avatar
      Linux 4.20-rc1 · 65102238
      Linus Torvalds authored
      65102238
    • Linus Torvalds's avatar
      Merge tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs · 42bd06e9
      Linus Torvalds authored
      Pull UBIFS updates from Richard Weinberger:
      
       - Full filesystem authentication feature, UBIFS is now able to have the
         whole filesystem structure authenticated plus user data encrypted and
         authenticated.
      
       - Minor cleanups
      
      * tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs: (26 commits)
        ubifs: Remove unneeded semicolon
        Documentation: ubifs: Add authentication whitepaper
        ubifs: Enable authentication support
        ubifs: Do not update inode size in-place in authenticated mode
        ubifs: Add hashes and HMACs to default filesystem
        ubifs: authentication: Authenticate super block node
        ubifs: Create hash for default LPT
        ubfis: authentication: Authenticate master node
        ubifs: authentication: Authenticate LPT
        ubifs: Authenticate replayed journal
        ubifs: Add auth nodes to garbage collector journal head
        ubifs: Add authentication nodes to journal
        ubifs: authentication: Add hashes to index nodes
        ubifs: Add hashes to the tree node cache
        ubifs: Create functions to embed a HMAC in a node
        ubifs: Add helper functions for authentication support
        ubifs: Add separate functions to init/crc a node
        ubifs: Format changes for authentication support
        ubifs: Store read superblock node
        ubifs: Drop write_node
        ...
      42bd06e9
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 4710e789
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Highlights include:
      
        Bugfix:
         - Fix build issues on architectures that don't provide 64-bit cmpxchg
      
        Cleanups:
         - Fix a spelling mistake"
      
      * tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFS: fix spelling mistake, EACCESS -> EACCES
        SUNRPC: Use atomic(64)_t for seq_send(64)
      4710e789
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 35e74524
      Linus Torvalds authored
      Pull more timer updates from Thomas Gleixner:
       "A set of commits for the new C-SKY architecture timers"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        dt-bindings: timer: gx6605s SOC timer
        clocksource/drivers/c-sky: Add gx6605s SOC system timer
        dt-bindings: timer: C-SKY Multi-processor timer
        clocksource/drivers/c-sky: Add C-SKY SMP timer
      35e74524
    • Linus Torvalds's avatar
      Merge tag 'ntb-4.20' of git://github.com/jonmason/ntb · 04578e84
      Linus Torvalds authored
      Pull NTB updates from Jon Mason:
       "Fairly minor changes and bug fixes:
      
        NTB IDT thermal changes and hook into hwmon, ntb_netdev clean-up of
        private struct, and a few bug fixes"
      
      * tag 'ntb-4.20' of git://github.com/jonmason/ntb:
        ntb: idt: Alter the driver info comments
        ntb: idt: Discard temperature sensor IRQ handler
        ntb: idt: Add basic hwmon sysfs interface
        ntb: idt: Alter temperature read method
        ntb_netdev: Simplify remove with client device drvdata
        NTB: transport: Try harder to alloc an aligned MW buffer
        ntb: ntb_transport: Mark expected switch fall-throughs
        ntb: idt: Set PCIe bus address to BARLIMITx
        NTB: ntb_hw_idt: replace IS_ERR_OR_NULL with regular NULL checks
        ntb: intel: fix return value for ndev_vec_mask()
        ntb_netdev: fix sleep time mismatch
      04578e84
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 71e56028
      Linus Torvalds authored
      Pull scheduler fixes from Ingo Molnar:
       "A memory (under-)allocation fix and a comment fix"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/topology: Fix off by one bug
        sched/rt: Update comment in pick_next_task_rt()
      71e56028
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 601a8807
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "A number of fixes and some late updates:
      
         - make in_compat_syscall() behavior on x86-32 similar to other
           platforms, this touches a number of generic files but is not
           intended to impact non-x86 platforms.
      
         - objtool fixes
      
         - PAT preemption fix
      
         - paravirt fixes/cleanups
      
         - cpufeatures updates for new instructions
      
         - earlyprintk quirk
      
         - make microcode version in sysfs world-readable (it is already
           world-readable in procfs)
      
         - minor cleanups and fixes"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        compat: Cleanup in_compat_syscall() callers
        x86/compat: Adjust in_compat_syscall() to generic code under !COMPAT
        objtool: Support GCC 9 cold subfunction naming scheme
        x86/numa_emulation: Fix uniform-split numa emulation
        x86/paravirt: Remove unused _paravirt_ident_32
        x86/mm/pat: Disable preemption around __flush_tlb_all()
        x86/paravirt: Remove GPL from pv_ops export
        x86/traps: Use format string with panic() call
        x86: Clean up 'sizeof x' => 'sizeof(x)'
        x86/cpufeatures: Enumerate MOVDIR64B instruction
        x86/cpufeatures: Enumerate MOVDIRI instruction
        x86/earlyprintk: Add a force option for pciserial device
        objtool: Support per-function rodata sections
        x86/microcode: Make revision and processor flags world-readable
      601a8807
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 01897f3e
      Linus Torvalds authored
      Pull perf updates and fixes from Ingo Molnar:
       "These are almost all tooling updates: 'perf top', 'perf trace' and
        'perf script' fixes and updates, an UAPI header sync with the merge
        window versions, license marker updates, much improved Sparc support
        from David Miller, and a number of fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
        perf intel-pt/bts: Calculate cpumode for synthesized samples
        perf intel-pt: Insert callchain context into synthesized callchains
        perf tools: Don't clone maps from parent when synthesizing forks
        perf top: Start display thread earlier
        tools headers uapi: Update linux/if_link.h header copy
        tools headers uapi: Update linux/netlink.h header copy
        tools headers: Sync the various kvm.h header copies
        tools include uapi: Update linux/mmap.h copy
        perf trace beauty: Use the mmap flags table generated from headers
        perf beauty: Wire up the mmap flags table generator to the Makefile
        perf beauty: Add a generator for MAP_ mmap's flag constants
        tools include uapi: Update asound.h copy
        tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies
        tools include uapi: Update linux/fs.h copy
        perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
        perf cs-etm: Correct CPU mode for samples
        perf unwind: Take pgoff into account when reporting elf to libdwfl
        perf top: Do not use overwrite mode by default
        perf top: Allow disabling the overwrite mode
        perf trace: Beautify mount's first pathname arg
        ...
      01897f3e
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e9ebc215
      Linus Torvalds authored
      Pull irq fixes from Ingo Molnar:
       "An irqchip driver fix and a memory (over-)allocation fix"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function
        irq/matrix: Fix memory overallocation
      e9ebc215
  8. 03 Nov, 2018 9 commits