1. 08 Apr, 2017 31 commits
    • John Garry's avatar
      scsi: libsas: fix ata xfer length · cf31d6d2
      John Garry authored
      commit 9702c67c upstream.
      
      The total ata xfer length may not be calculated properly, in that we do
      not use the proper method to get an sg element dma length.
      
      According to the code comment, sg_dma_len() should be used after
      dma_map_sg() is called.
      
      This issue was found by turning on the SMMUv3 in front of the hisi_sas
      controller in hip07. Multiple sg elements were being combined into a
      single element, but the original first element length was being use as
      the total xfer length.
      
      Fixes: ff2aeb1e ("libata: convert to chained sg")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf31d6d2
    • peter chang's avatar
      scsi: sg: check length passed to SG_NEXT_CMD_LEN · c2a86952
      peter chang authored
      commit bf33f87d upstream.
      
      The user can control the size of the next command passed along, but the
      value passed to the ioctl isn't checked against the usable max command
      size.
      Signed-off-by: default avatarPeter Chang <dpf@google.com>
      Acked-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c2a86952
    • Christoph Hellwig's avatar
      xfs: try any AG when allocating the first btree block when reflinking · d5dbd1c9
      Christoph Hellwig authored
      commit 2fcc319d upstream.
      
      When a reflink operation causes the bmap code to allocate a btree block
      we're currently doing single-AG allocations due to having ->firstblock
      set and then try any higher AG due a little reflink quirk we've put in
      when adding the reflink code.  But given that we do not have a minleft
      reservation of any kind in this AG we can still not have any space in
      the same or higher AG even if the file system has enough free space.
      To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
      path instead.
      
      [And yes, we need to redo this properly instead of piling hacks over
       hacks.  I'm working on that, but it's not going to be a small series.
       In the meantime this fixes the customer reported issue]
      
      Also add a warning for failing allocations to make it easier to debug.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5dbd1c9
    • Brian Foster's avatar
      xfs: use iomap new flag for newly allocated delalloc blocks · da617af8
      Brian Foster authored
      commit f65e6fad upstream.
      
      Commit fa7f138a ("xfs: clear delalloc and cache on buffered write
      failure") fixed one regression in the iomap error handling code and
      exposed another. The fundamental problem is that if a buffered write
      is a rewrite of preexisting delalloc blocks and the write fails, the
      failure handling code can punch out preexisting blocks with valid
      file data.
      
      This was reproduced directly by sub-block writes in the LTP
      kernel/syscalls/write/write03 test. A first 100 byte write allocates
      a single block in a file. A subsequent 100 byte write fails and
      punches out the block, including the data successfully written by
      the previous write.
      
      To address this problem, update the ->iomap_begin() handler to
      distinguish newly allocated delalloc blocks from preexisting
      delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
      ->iomap_end() handler to decide when a failed or short write should
      punch out delalloc blocks.
      
      This introduces the subtle requirement that ->iomap_begin() should
      never combine newly allocated delalloc blocks with existing blocks
      in the resulting iomap descriptor. This can occur when a new
      delalloc reservation merges with a neighboring extent that is part
      of the current write, for example. Therefore, drop the
      post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
      just return the record inserted into the fork. This ensures only new
      blocks are returned and thus that preexisting delalloc blocks are
      always handled as "found" blocks and not punched out on a failed
      rewrite.
      Reported-by: default avatarXiong Zhou <xzhou@redhat.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da617af8
    • Chandan Rajendra's avatar
      xfs: Use xfs_icluster_size_fsb() to calculate inode alignment mask · 77aedb0c
      Chandan Rajendra authored
      commit d5825712 upstream.
      
      When block size is larger than inode cluster size, the call to
      XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
      would have set xfs_sb->sb_inoalignmt to 0. Hence in
      xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to
      -1 instead of 0. However, xfs_mount->m_sinoalign would get correctly
      intialized to 0 because for every positive value of xfs_mount->m_dalign,
      the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to
      false.
      
      Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having
      -1 as the value because blks_per_cluster variable would have the value 1
      and hence we would never have a need to use xfs_mount->m_inoalign_mask
      to compute the inode chunk's agbno and offset within the chunk.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77aedb0c
    • Christoph Hellwig's avatar
      xfs: fix and streamline error handling in xfs_end_io · d07b5855
      Christoph Hellwig authored
      commit 787eb485 upstream.
      
      There are two different cases of buffered I/O errors:
      
       - first we can have an already shutdown fs.  In that case we should skip
         any on-disk operations and just clean up the appen transaction if
         present and destroy the ioend
       - a real I/O error.  In that case we should cleanup any lingering COW
         blocks.  This gets skipped in the current code and is fixed by this
         patch.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d07b5855
    • Christoph Hellwig's avatar
      xfs: only reclaim unwritten COW extents periodically · 3b83a02a
      Christoph Hellwig authored
      commit 3802a345 upstream.
      
      We only want to reclaim preallocations from our periodic work item.
      Currently this is archived by looking for a dirty inode, but that check
      is rather fragile.  Instead add a flag to xfs_reflink_cancel_cow_* so
      that the caller can ask for just cancelling unwritten extents in the COW
      fork.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: fix typos in commit message]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b83a02a
    • Christoph Hellwig's avatar
      xfs: tune down agno asserts in the bmap code · a2402936
      Christoph Hellwig authored
      commit 410d17f6 upstream.
      
      In various places we currently assert that xfs_bmap_btalloc allocates
      from the same as the firstblock value passed in, unless it's either
      NULLAGNO or the dop_low flag is set.  But the reflink code does not
      fully follow this convention as it passes in firstblock purely as
      a hint for the allocator without actually having previous allocations
      in the transaction, and without having a minleft check on the current
      AG, leading to the assert firing on a very full and heavily used
      file system.  As even the reflink code only allocates from equal or
      higher AGs for now we can simply the check to always allow for equal
      or higher AGs.
      
      Note that we need to eventually split the two meanings of the firstblock
      value.  At that point we can also allow the reflink code to allocate
      from any AG instead of limiting it in any way.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2402936
    • Chandan Rajendra's avatar
      xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment · 9559c48c
      Chandan Rajendra authored
      commit 8ee9fdbe upstream.
      
      On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace,
      
      XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026
      
      kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113!
      Oops: Exception in kernel mode, sig: 5 [#1]
      SMP NR_CPUS=2048
      DEBUG_PAGEALLOC
      NUMA
      pSeries
      Modules linked in:
      CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58
      task: c000000102606d80 task.stack: c0000001026b8000
      NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290
      REGS: c0000001026bb090 TRAP: 0700   Not tainted  (4.10.0-rc5)
      MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>
      CR: 28004428  XER: 00000000
      CFAR: c0000000004ef180 SOFTE: 1
      GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea
      GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0
      GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990
      GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000
      GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0
      GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001
      GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000
      GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0
      NIP [c0000000004ef798] .assfail+0x28/0x30
      LR [c0000000004ef798] .assfail+0x28/0x30
      Call Trace:
      [c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable)
      [c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0
      [c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480
      [c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90
      [c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820
      [c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320
      [c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610
      [c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0
      [c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790
      [c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410
      [c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230
      [c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120
      [c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc
      Instruction dump:
      4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78
      38600000 f8010010 f821ff91 4bfff94d <0fe00000> 60000000 7c0802a6 7c892378
      
      When block size is larger than inode cluster size, the call to
      XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
      would have set xfs_sb->sb_inoalignmt to 0. This causes
      xfs_ialloc_cluster_alignment() to return 0.  Due to this
      args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned
      equivalent of -1 assigned to it. This later causes alloc_len in
      xfs_alloc_space_available() to have a value of 0. In such a scenario
      when args.total is also 0, the assert statement "ASSERT(args->maxlen >
      0);" fails.
      
      This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in
      xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb().
      Suggested-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9559c48c
    • Brian Foster's avatar
      xfs: don't reserve blocks for right shift transactions · 5db7b41b
      Brian Foster authored
      commit 48af96ab upstream.
      
      The block reservation for the transaction allocated in
      xfs_shift_file_space() is an artifact of the original collapse range
      support. It exists to handle the case where a collapse range occurs,
      the initial extent is left shifted into a location that forms a
      contiguous boundary with the previous extent and thus the extents
      are merged. This code was subsequently refactored and reused for
      insert range (right shift) support.
      
      If an insert range occurs under low free space conditions, the
      extent at the starting offset is split before the first shift
      transaction is allocated. If the block reservation fails, this
      leaves separate, but contiguous extents around in the inode. While
      not a fatal problem, this is unexpected and will flag a warning on
      subsequent insert range operations on the inode. This problem has
      been reproduce intermittently by generic/270 running against a
      ramdisk device.
      
      Since right shift does not create new extent boundaries in the
      inode, a block reservation for extent merge is unnecessary. Update
      xfs_shift_file_space() to conditionally reserve fs blocks for left
      shift transactions only. This avoids the warning reproduced by
      generic/270.
      Reported-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5db7b41b
    • Darrick J. Wong's avatar
      xfs: fix uninitialized variable in _reflink_convert_cow · e5e2e56f
      Darrick J. Wong authored
      commit 93aaead5 upstream.
      
      Fix an uninitialize variable.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5e2e56f
    • Brian Foster's avatar
      xfs: split indlen reservations fairly when under reserved · c251c6c2
      Brian Foster authored
      commit 75d65361 upstream.
      
      Certain workoads that punch holes into speculative preallocation can
      cause delalloc indirect reservation splits when the delalloc extent is
      split in two. If further splits occur, an already short-handed extent
      can be split into two in a manner that leaves zero indirect blocks for
      one of the two new extents. This occurs because the shortage is large
      enough that the xfs_bmap_split_indlen() algorithm completely drains the
      requested indlen of one of the extents before it honors the existing
      reservation.
      
      This ultimately results in a warning from xfs_bmap_del_extent(). This
      has been observed during file copies of large, sparse files using 'cp
      --sparse=always.'
      
      To avoid this problem, update xfs_bmap_split_indlen() to explicitly
      apply the reservation shortage fairly between both extents. This smooths
      out the overall indlen shortage and defers the situation where we end up
      with a delalloc extent with zero indlen reservation to extreme
      circumstances.
      Reported-by: default avatarPatrick Dung <mpatdung@gmail.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c251c6c2
    • Brian Foster's avatar
      xfs: handle indlen shortage on delalloc extent merge · 2d7c1c7f
      Brian Foster authored
      commit 0e339ef8 upstream.
      
      When a delalloc extent is created, it can be merged with pre-existing,
      contiguous, delalloc extents. When this occurs,
      xfs_bmap_add_extent_hole_delay() merges the extents along with the
      associated indirect block reservations. The expectation here is that the
      combined worst case indlen reservation is always less than or equal to
      the indlen reservation for the individual extents.
      
      This is not always the case, however, as existing extents can less than
      the expected indlen reservation if the extent was previously split due
      to a hole punch. If a new extent merges with such an extent, the total
      indlen requirement may be larger than the sum of the indlen reservations
      held by both extents.
      
      xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
      reservation is always available and assigns it to the merged extent
      without consideration for the indlen held by the pre-existing extent. As
      a result, the subsequent xfs_mod_fdblocks() call can attempt an
      unintentional allocation rather than a free (indicated by an ASSERT()
      failure). Further, if the allocation happens to fail in this context,
      the failure goes unhandled and creates a filesystem wide block
      accounting inconsistency.
      
      Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
      indlen reservation assigned to the merged extent to the sum of the
      indlen reservations held by each of the individual extents.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d7c1c7f
    • Christoph Hellwig's avatar
      xfs: don't fail xfs_extent_busy allocation · 47d7d1ea
      Christoph Hellwig authored
      commit 5e30c23d upstream.
      
      We don't just need the structure to track busy extents which can be
      avoided with a synchronous transaction, but also to keep track of
      pending discard.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47d7d1ea
    • Christoph Hellwig's avatar
      xfs: reject all unaligned direct writes to reflinked files · 5bbf5ba6
      Christoph Hellwig authored
      commit 54a4ef8a upstream.
      
      We currently fall back from direct to buffered writes if we detect a
      remaining shared extent in the iomap_begin callback.  But by the time
      iomap_begin is called for the potentially unaligned end block we might
      have already written most of the data to disk, which we'd now write
      again using buffered I/O.  To avoid this reject all writes to reflinked
      files before starting I/O so that we are guaranteed to only write the
      data once.
      
      The alternative would be to unshare the unaligned start and/or end block
      before doing the I/O. I think that's doable, and will actually be
      required to support reflinks on DAX file system.  But it will take a
      little more time and I'd rather get rid of the double write ASAP.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [slight changes in context due to the new direct I/O code in 4.10+]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5bbf5ba6
    • Christoph Hellwig's avatar
      xfs: update ctime and mtime on clone destinatation inodes · 67eb7bf8
      Christoph Hellwig authored
      commit c5ecb423 upstream.
      
      We're changing both metadata and data, so we need to update the
      timestamps for clone operations.  Dedupe on the other hand does
      not change file data, and only changes invisible metadata so the
      timestamps should not be updated.
      
      This follows existing btrfs behavior.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: remove redundant is_dedupe test]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67eb7bf8
    • Hou Tao's avatar
      xfs: reset b_first_retry_time when clear the retry status of xfs_buf_t · e060f488
      Hou Tao authored
      commit 4dd2eb63 upstream.
      
      After successful IO or permanent error, b_first_retry_time also
      needs to be cleared, else the invalid first retry time will be
      used by the next retry check.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e060f488
    • Darrick J. Wong's avatar
      xfs: mark speculative prealloc CoW fork extents unwritten · e02f0ff2
      Darrick J. Wong authored
      commit 5eda4300 upstream.
      
      Christoph Hellwig pointed out that there's a potentially nasty race when
      performing simultaneous nearby directio cow writes:
      
      "Thread 1 writes a range from B to c
      
      "                    B --------- C
                                 p
      
      "a little later thread 2 writes from A to B
      
      "        A --------- B
                     p
      
      [editor's note: the 'p' denote cowextsize boundaries, which I added to
      make this more clear]
      
      "but the code preallocates beyond B into the range where thread
      "1 has just written, but ->end_io hasn't been called yet.
      "But once ->end_io is called thread 2 has already allocated
      "up to the extent size hint into the write range of thread 1,
      "so the end_io handler will splice the unintialized blocks from
      "that preallocation back into the file right after B."
      
      We can avoid this race by ensuring that thread 1 cannot accidentally
      remap the blocks that thread 2 allocated (as part of speculative
      preallocation) as part of t2's write preparation in t1's end_io handler.
      The way we make this happen is by taking advantage of the unwritten
      extent flag as an intermediate step.
      
      Recall that when we begin the process of writing data to shared blocks,
      we create a delayed allocation extent in the CoW fork:
      
      D: --RRRRRRSSSRRRRRRRR---
      C: ------DDDDDDD---------
      
      When a thread prepares to CoW some dirty data out to disk, it will now
      convert the delalloc reservation into an /unwritten/ allocated extent in
      the cow fork.  The da conversion code tries to opportunistically
      allocate as much of a (speculatively prealloc'd) extent as possible, so
      we may end up allocating a larger extent than we're actually writing
      out:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UUUUUUU---------
      
      Next, we convert only the part of the extent that we're actively
      planning to write to normal (i.e. not unwritten) status:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UURRUUU---------
      
      If the write succeeds, the end_cow function will now scan the relevant
      range of the CoW fork for real extents and remap only the real extents
      into the data fork:
      
      D: --RRRRRRRRSRRRRRRRR---
      U: ------UU--UUU---------
      
      This ensures that we never obliterate valid data fork extents with
      unwritten blocks from the CoW fork.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e02f0ff2
    • Darrick J. Wong's avatar
      xfs: allow unwritten extents in the CoW fork · 8370826f
      Darrick J. Wong authored
      commit 05a630d7 upstream.
      
      In the data fork, we only allow extents to perform the following state
      transitions:
      
      delay -> real <-> unwritten
      
      There's no way to move directly from a delalloc reservation to an
      /unwritten/ allocated extent.  However, for the CoW fork we want to be
      able to do the following to each extent:
      
      delalloc -> unwritten -> written -> remapped to data fork
      
      This will help us to avoid a race in the speculative CoW preallocation
      code between a first thread that is allocating a CoW extent and a second
      thread that is remapping part of a file after a write.  In order to do
      this, however, we need two things: first, we have to be able to
      transition from da to unwritten, and second the function that converts
      between real and unwritten has to be made aware of the cow fork.  Do
      both of those things.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8370826f
    • Darrick J. Wong's avatar
      xfs: verify free block header fields · 3d2bd2fd
      Darrick J. Wong authored
      commit de14c5f5 upstream.
      
      Perform basic sanity checking of the directory free block header
      fields so that we avoid hanging the system on invalid data.
      
      (Granted that just means that now we shutdown on directory write,
      but that seems better than hanging...)
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d2bd2fd
    • Darrick J. Wong's avatar
      xfs: check for obviously bad level values in the bmbt root · 4056a74a
      Darrick J. Wong authored
      commit b3bf607d upstream.
      
      We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
      no such thing as a zero-level bmbt (for that we have extents format),
      so if we see this, send back an error code.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4056a74a
    • Darrick J. Wong's avatar
      xfs: filter out obviously bad btree pointers · efab3ae2
      Darrick J. Wong authored
      commit d5a91bae upstream.
      
      Don't let anybody load an obviously bad btree pointer.  Since the values
      come from disk, we must return an error, not just ASSERT.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      efab3ae2
    • Darrick J. Wong's avatar
      xfs: fail _dir_open when readahead fails · 7e2dd1fb
      Darrick J. Wong authored
      commit 7a652bbe upstream.
      
      When we open a directory, we try to readahead block 0 of the directory
      on the assumption that we're going to need it soon.  If the bmbt is
      corrupt, the directory will never be usable and the readahead fails
      immediately, so we might as well prevent the directory from being opened
      at all.  This prevents a subsequent read or modify operation from
      hitting it and taking the fs offline.
      
      NOTE: We're only checking for early failures in the block mapping, not
      the readahead directory block itself.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e2dd1fb
    • Darrick J. Wong's avatar
      xfs: fix toctou race when locking an inode to access the data map · 0a6844ab
      Darrick J. Wong authored
      commit 4b5bd5bf upstream.
      
      We use di_format and if_flags to decide whether we're grabbing the ilock
      in btree mode (btree extents not loaded) or shared mode (anything else),
      but the state of those fields can be changed by other threads that are
      also trying to load the btree extents -- IFEXTENTS gets set before the
      _bmap_read_extents call and cleared if it fails.
      
      We don't actually need to have IFEXTENTS set until after the bmbt
      records are successfully loaded and validated, which will fix the race
      between multiple threads trying to read the same directory.  The next
      patch strengthens directory bmbt validation by refusing to open the
      directory if reading the bmbt to start directory readahead fails.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a6844ab
    • Brian Foster's avatar
      xfs: fix eofblocks race with file extending async dio writes · 4127a5d9
      Brian Foster authored
      commit e4229d6b upstream.
      
      It's possible for post-eof blocks to end up being used for direct I/O
      writes. dio write performs an upfront unwritten extent allocation, sends
      the dio and then updates the inode size (if necessary) on write
      completion. If a file release occurs while a file extending dio write is
      in flight, it is possible to mistake the post-eof blocks for speculative
      preallocation and incorrectly truncate them from the inode. This means
      that the resulting dio write completion can discover a hole and allocate
      new blocks rather than perform unwritten extent conversion.
      
      This requires a strange mix of I/O and is thus not likely to reproduce
      in real world workloads. It is intermittently reproduced by generic/299.
      The error manifests as an assert failure due to transaction overrun
      because the aforementioned write completion transaction has only
      reserved enough blocks for btree operations:
      
        XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
         file: fs/xfs//xfs_trans.c, line: 309
      
      The root cause is that xfs_free_eofblocks() uses i_size to truncate
      post-eof blocks from the inode, but async, file extending direct writes
      do not update i_size until write completion, long after inode locks are
      dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
      to the incorrect size.
      
      Update xfs_free_eofblocks() to serialize against dio similar to how
      extending writes are serialized against i_size updates before post-eof
      block zeroing. Specifically, wait on dio while under the iolock. This
      ensures that dio write completions have updated i_size before post-eof
      blocks are processed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4127a5d9
    • Brian Foster's avatar
      xfs: sync eofblocks scans under iolock are livelock prone · 4d725d74
      Brian Foster authored
      commit c3155097 upstream.
      
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      4d725d74
    • Brian Foster's avatar
      xfs: pull up iolock from xfs_free_eofblocks() · 798b1dc5
      Brian Foster authored
      commit a36b9261 upstream.
      
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      798b1dc5
    • Christoph Hellwig's avatar
      xfs: use per-AG reservations for the finobt · 08a2a268
      Christoph Hellwig authored
      commit 76d771b4 upstream.
      
      Currently we try to rely on the global reserved block pool for block
      allocations for the free inode btree, but I have customer reports
      (fairly complex workload, need to find an easier reproducer) where that
      is not enough as the AG where we free an inode that requires a new
      finobt block is entirely full.  This causes us to cancel a dirty
      transaction and thus a file system shutdown.
      
      I think the right way to guard against this is to treat the finot the same
      way as the refcount btree and have a per-AG reservations for the possible
      worst case size of it, and the patch below implements that.
      
      Note that this could increase mount times with large finobt trees.  In
      an ideal world we would have added a field for the number of finobt
      fields to the AGI, similar to what we did for the refcount blocks.
      We should do add it next time we rev the AGI or AGF format by adding
      new fields.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08a2a268
    • Christoph Hellwig's avatar
      xfs: only update mount/resv fields on success in __xfs_ag_resv_init · 9be1c33d
      Christoph Hellwig authored
      commit 4dfa2b84 upstream.
      
      Try to reserve the blocks first and only then update the fields in
      or hanging off the mount structure.  This way we can call __xfs_ag_resv_init
      again after a previous failure.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9be1c33d
    • Ross Lagerwall's avatar
      xen/setup: Don't relocate p2m over existing one · 8b08aec6
      Ross Lagerwall authored
      commit 7ecec850 upstream.
      
      When relocating the p2m, take special care not to relocate it so
      that is overlaps with the current location of the p2m/initrd. This is
      needed since the full extent of the current location is not marked as a
      reserved region in the e820.
      
      This was seen to happen to a dom0 with a large initial p2m and a small
      reserved region in the middle of the initial p2m.
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b08aec6
    • Ilya Dryomov's avatar
      libceph: force GFP_NOIO for socket allocations · 86015377
      Ilya Dryomov authored
      commit 633ee407 upstream.
      
      sock_alloc_inode() allocates socket+inode and socket_wq with
      GFP_KERNEL, which is not allowed on the writeback path:
      
          Workqueue: ceph-msgr con_work [libceph]
          ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
          0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
          ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
          Call Trace:
          [<ffffffff816dd629>] schedule+0x29/0x70
          [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
          [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
          [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
          [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
          [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
          [<ffffffff81086335>] flush_work+0x165/0x250
          [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
          [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
          [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
          [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
          [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
          [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
          [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
          [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
          [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
          [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
          [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
          [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
          [<ffffffff8115af70>] shrink_slab+0x100/0x140
          [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
          [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
          [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
          [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
          [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
          [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
          [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
          [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
          [<ffffffff811d8566>] alloc_inode+0x26/0xa0
          [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
          [<ffffffff815b933e>] sock_alloc+0x1e/0x80
          [<ffffffff815ba855>] __sock_create+0x95/0x220
          [<ffffffff815baa04>] sock_create_kern+0x24/0x30
          [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
          [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
          [<ffffffff81084c19>] process_one_work+0x159/0x4f0
          [<ffffffff8108561b>] worker_thread+0x11b/0x530
          [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
          [<ffffffff8108b6f9>] kthread+0xc9/0xe0
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
          [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
      
      Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.
      
      Link: http://tracker.ceph.com/issues/19309Reported-by: default avatarSergey Jerusalimov <wintchester@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86015377
  2. 31 Mar, 2017 9 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.9.20 · f6767727
      Greg Kroah-Hartman authored
      f6767727
    • Bin Liu's avatar
      usb: musb: fix possible spinlock deadlock · 1dc3a068
      Bin Liu authored
      commit bc1e2154 upstream.
      
      The DSPS glue calls del_timer_sync() in its musb_platform_disable()
      implementation, which requires the caller to not hold a lock. But
      musb_remove() calls musb_platform_disable() will musb->lock held. This
      could causes spinlock deadlock.
      
      So change musb_remove() to call musb_platform_disable() without holds
      musb->lock. This doesn't impact the musb_platform_disable implementation
      in other glue drivers.
      
      root@am335x-evm:~# modprobe -r musb-dsps
      [  126.134879] musb-hdrc musb-hdrc.1: remove, state 1
      [  126.140465] usb usb2: USB disconnect, device number 1
      [  126.146178] usb 2-1: USB disconnect, device number 2
      [  126.416985] musb-hdrc musb-hdrc.1: USB bus 2 deregistered
      [  126.423943]
      [  126.425525] ======================================================
      [  126.431997] [ INFO: possible circular locking dependency detected ]
      [  126.438564] 4.11.0-rc1-00003-g1557f13bca04-dirty #77 Not tainted
      [  126.444852] -------------------------------------------------------
      [  126.451414] modprobe/778 is trying to acquire lock:
      [  126.456523]  (((&glue->timer))){+.-...}, at: [<c01b8788>] del_timer_sync+0x0/0xd0
      [  126.464403]
      [  126.464403] but task is already holding lock:
      [  126.470511]  (&(&musb->lock)->rlock){-.-...}, at: [<bf30b7f8>] musb_remove+0x50/0x1
      30 [musb_hdrc]
      [  126.479965]
      [  126.479965] which lock already depends on the new lock.
      [  126.479965]
      [  126.488531]
      [  126.488531] the existing dependency chain (in reverse order) is:
      [  126.496368]
      [  126.496368] -> #1 (&(&musb->lock)->rlock){-.-...}:
      [  126.502968]        otg_timer+0x80/0xec [musb_dsps]
      [  126.507990]        call_timer_fn+0xb4/0x390
      [  126.512372]        expire_timers+0xf0/0x1fc
      [  126.516754]        run_timer_softirq+0x80/0x178
      [  126.521511]        __do_softirq+0xc4/0x554
      [  126.525802]        irq_exit+0xe8/0x158
      [  126.529735]        __handle_domain_irq+0x58/0xb8
      [  126.534583]        __irq_usr+0x54/0x80
      [  126.538507]
      [  126.538507] -> #0 (((&glue->timer))){+.-...}:
      [  126.544636]        del_timer_sync+0x40/0xd0
      [  126.549066]        musb_remove+0x6c/0x130 [musb_hdrc]
      [  126.554370]        platform_drv_remove+0x24/0x3c
      [  126.559206]        device_release_driver_internal+0x14c/0x1e0
      [  126.565225]        bus_remove_device+0xd8/0x108
      [  126.569970]        device_del+0x1e4/0x308
      [  126.574170]        platform_device_del+0x24/0x8c
      [  126.579006]        platform_device_unregister+0xc/0x20
      [  126.584394]        dsps_remove+0x14/0x30 [musb_dsps]
      [  126.589595]        platform_drv_remove+0x24/0x3c
      [  126.594432]        device_release_driver_internal+0x14c/0x1e0
      [  126.600450]        driver_detach+0x38/0x6c
      [  126.604740]        bus_remove_driver+0x4c/0xa0
      [  126.609407]        SyS_delete_module+0x11c/0x1e4
      [  126.614252]        __sys_trace_return+0x0/0x10
      
      Fixes: ea2f35c0 ("usb: musb: Fix sleeping function called from invalid context for hdrc glue")
      Acked-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarBin Liu <b-liu@ti.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1dc3a068
    • Sebastian Andrzej Siewior's avatar
      sched/rt: Add a missing rescheduling point · 916c5cfe
      Sebastian Andrzej Siewior authored
      commit 619bd4a7 upstream.
      
      Since the change in commit:
      
        fd7a4bed ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")
      
      ... we don't reschedule a task under certain circumstances:
      
      Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
      CPU0) and holds a PI lock. This task is removed from the CPU because it
      used up its time slice and another SCHED_OTHER task is running. Task-B on
      CPU1 runs at RT priority and asks for the lock owned by task-A. This
      results in a priority boost for task-A. Task-B goes to sleep until the
      lock has been made available. Task-A is already runnable (but not active),
      so it receives no wake up.
      
      The reality now is that task-A gets on the CPU once the scheduler decides
      to remove the current task despite the fact that a high priority task is
      enqueued and waiting. This may take a long time.
      
      The desired behaviour is that CPU0 immediately reschedules after the
      priority boost which made task-A the task with the lowest priority.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: fd7a4bed ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
      Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      916c5cfe
    • Eric Biggers's avatar
      fscrypt: remove broken support for detecting keyring key revocation · 2984e52c
      Eric Biggers authored
      commit 1b53cf98 upstream.
      
      Filesystem encryption ostensibly supported revoking a keyring key that
      had been used to "unlock" encrypted files, causing those files to become
      "locked" again.  This was, however, buggy for several reasons, the most
      severe of which was that when key revocation happened to be detected for
      an inode, its fscrypt_info was immediately freed, even while other
      threads could be using it for encryption or decryption concurrently.
      This could be exploited to crash the kernel or worse.
      
      This patch fixes the use-after-free by removing the code which detects
      the keyring key having been revoked, invalidated, or expired.  Instead,
      an encrypted inode that is "unlocked" now simply remains unlocked until
      it is evicted from memory.  Note that this is no worse than the case for
      block device-level encryption, e.g. dm-crypt, and it still remains
      possible for a privileged user to evict unused pages, inodes, and
      dentries by running 'sync; echo 3 > /proc/sys/vm/drop_caches', or by
      simply unmounting the filesystem.  In fact, one of those actions was
      already needed anyway for key revocation to work even somewhat sanely.
      This change is not expected to break any applications.
      
      In the future I'd like to implement a real API for fscrypt key
      revocation that interacts sanely with ongoing filesystem operations ---
      waiting for existing operations to complete and blocking new operations,
      and invalidating and sanitizing key material and plaintext from the VFS
      caches.  But this is a hard problem, and for now this bug must be fixed.
      
      This bug affected almost all versions of ext4, f2fs, and ubifs
      encryption, and it was potentially reachable in any kernel configured
      with encryption support (CONFIG_EXT4_ENCRYPTION=y,
      CONFIG_EXT4_FS_ENCRYPTION=y, CONFIG_F2FS_FS_ENCRYPTION=y, or
      CONFIG_UBIFS_FS_ENCRYPTION=y).  Note that older kernels did not use the
      shared fs/crypto/ code, but due to the potential security implications
      of this bug, it may still be worthwhile to backport this fix to them.
      
      Fixes: b7236e21 ("ext4 crypto: reorganize how we store keys in the inode")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Acked-by: default avatarMichael Halcrow <mhalcrow@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2984e52c
    • Dave Martin's avatar
      metag/ptrace: Reject partial NT_METAG_RPIPE writes · 21c95eca
      Dave Martin authored
      commit 7195ee31 upstream.
      
      It's not clear what behaviour is sensible when doing partial write of
      NT_METAG_RPIPE, so just don't bother.
      
      This patch assumes that userspace will never rely on a partial SETREGSET
      in this case, since it's not clear what should happen anyway.
      Signed-off-by: default avatarDave Martin <Dave.Martin@arm.com>
      Acked-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21c95eca
    • Dave Martin's avatar
      metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUS · 2d6532ce
      Dave Martin authored
      commit 5fe81fe9 upstream.
      
      Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
      to fill TXSTATUS, a well-defined default value is used, based on the
      task's current value.
      Suggested-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarDave Martin <Dave.Martin@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d6532ce
    • Dave Martin's avatar
      metag/ptrace: Preserve previous registers for short regset write · 2739b487
      Dave Martin authored
      commit a78ce80d upstream.
      
      Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
      to fill all the registers, the thread's old registers are preserved.
      Signed-off-by: default avatarDave Martin <Dave.Martin@arm.com>
      Acked-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2739b487
    • Dave Martin's avatar
      sparc/ptrace: Preserve previous registers for short regset write · 84b94c43
      Dave Martin authored
      commit d3805c54 upstream.
      
      Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
      to fill all the registers, the thread's old registers are preserved.
      Signed-off-by: default avatarDave Martin <Dave.Martin@arm.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84b94c43
    • Dave Martin's avatar
      mips/ptrace: Preserve previous registers for short regset write · 0ba34c87
      Dave Martin authored
      commit d614fd58 upstream.
      
      Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
      to fill all the registers, the thread's old registers are preserved.
      Signed-off-by: default avatarDave Martin <Dave.Martin@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ba34c87