1. 03 May, 2019 6 commits
    • Julien Grall's avatar
      irqchip/gic-v3-mbi: Don't map the MSI page in mbi_compose_m{b, s}i_msg() · 73103975
      Julien Grall authored
      The functions mbi_compose_m{b, s}i_msg may be called from non-preemptible
      context. However, on RT, iommu_dma_map_msi_msg() requires to be called
      from a preemptible context.
      
      A recent patch split iommu_dma_map_msi_msg in two new functions:
      one that should be called in preemptible context, the other does
      not have any requirement.
      
      The GICv3 MSI driver is reworked to avoid executing preemptible code in
      non-preemptible context. This can be achieved by preparing the MSI
      mapping when allocating the MSI interrupt.
      Signed-off-by: default avatarJulien Grall <julien.grall@arm.com>
      [maz: only call iommu_dma_prepare_msi once, fix commit log accordingly]
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      73103975
    • Julien Grall's avatar
      irqchip/ls-scfg-msi: Don't map the MSI page in ls_scfg_msi_compose_msg() · 2cb3b165
      Julien Grall authored
      ls_scfg_msi_compose_msg() may be called from non-preemptible context.
      However, on RT, iommu_dma_map_msi_msg() requires to be called from a
      preemptible context.
      
      A recent patch split iommu_dma_map_msi_msg() in two new functions:
      one that should be called in preemptible context, the other does
      not have any requirement.
      
      The FreeScale SCFG MSI driver is reworked to avoid executing preemptible
      code in non-preemptible context. This can be achieved by preparing the
      MSI maping when allocating the MSI interrupt.
      Signed-off-by: default avatarJulien Grall <julien.grall@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      2cb3b165
    • Julien Grall's avatar
      irqchip/gic-v3-its: Don't map the MSI page in its_irq_compose_msi_msg() · 35ae7df2
      Julien Grall authored
      its_irq_compose_msi_msg() may be called from non-preemptible context.
      However, on RT, iommu_dma_map_msi_msg requires to be called from a
      preemptible context.
      
      A recent change split iommu_dma_map_msi_msg() in two new functions:
      one that should be called in preemptible context, the other does
      not have any requirement.
      
      The GICv3 ITS driver is reworked to avoid executing preemptible code in
      non-preemptible context. This can be achieved by preparing the MSI
      mapping when allocating the MSI interrupt.
      Signed-off-by: default avatarJulien Grall <julien.grall@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      35ae7df2
    • Julien Grall's avatar
      irqchip/gicv2m: Don't map the MSI page in gicv2m_compose_msi_msg() · 737be747
      Julien Grall authored
      gicv2m_compose_msi_msg() may be called from non-preemptible context.
      However, on RT, iommu_dma_map_msi_msg() requires to be called from a
      preemptible context.
      
      A recent change split iommu_dma_map_msi_msg() in two new functions:
      one that should be called in preemptible context, the other does
      not have any requirement.
      
      The GICv2m driver is reworked to avoid executing preemptible code in
      non-preemptible context. This can be achieved by preparing the MSI
      mapping when allocating the MSI interrupt.
      Signed-off-by: default avatarJulien Grall <julien.grall@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      737be747
    • Julien Grall's avatar
      iommu/dma-iommu: Split iommu_dma_map_msi_msg() in two parts · ece6e6f0
      Julien Grall authored
      On RT, iommu_dma_map_msi_msg() may be called from non-preemptible
      context. This will lead to a splat with CONFIG_DEBUG_ATOMIC_SLEEP as
      the function is using spin_lock (they can sleep on RT).
      
      iommu_dma_map_msi_msg() is used to map the MSI page in the IOMMU PT
      and update the MSI message with the IOVA.
      
      Only the part to lookup for the MSI page requires to be called in
      preemptible context. As the MSI page cannot change over the lifecycle
      of the MSI interrupt, the lookup can be cached and re-used later on.
      
      iomma_dma_map_msi_msg() is now split in two functions:
          - iommu_dma_prepare_msi(): This function will prepare the mapping
          in the IOMMU and store the cookie in the structure msi_desc. This
          function should be called in preemptible context.
          - iommu_dma_compose_msi_msg(): This function will update the MSI
          message with the IOVA when the device is behind an IOMMU.
      Signed-off-by: default avatarJulien Grall <julien.grall@arm.com>
      Reviewed-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Acked-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      ece6e6f0
    • Julien Grall's avatar
      genirq/msi: Add a new field in msi_desc to store an IOMMU cookie · aaebdf8d
      Julien Grall authored
      When an MSI doorbell is located downstream of an IOMMU, it is required
      to swizzle the physical address with an appropriately-mapped IOVA for any
      device attached to one of our DMA ops domain.
      
      At the moment, the allocation of the mapping may be done when composing
      the message. However, the composing may be done in non-preemtible
      context while the allocation requires to be called from preemptible
      context.
      
      A follow-up change will split the current logic in two functions
      requiring to keep an IOMMU cookie per MSI.
      
      A new field is introduced in msi_desc to store an IOMMU cookie. As the
      cookie may not be required in some configuration, the field is protected
      under a new config CONFIG_IRQ_MSI_IOMMU.
      
      A pair of helpers has also been introduced to access the field.
      Signed-off-by: default avatarJulien Grall <julien.grall@arm.com>
      Reviewed-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      aaebdf8d
  2. 01 May, 2019 14 commits
  3. 29 Apr, 2019 13 commits
  4. 14 Apr, 2019 6 commits
    • Linus Torvalds's avatar
      Linux 5.1-rc5 · dc4060a5
      Linus Torvalds authored
      dc4060a5
    • Linus Torvalds's avatar
      Merge branch 'page-refs' (page ref overflow) · 6b3a7077
      Linus Torvalds authored
      Merge page ref overflow branch.
      
      Jann Horn reported that he can overflow the page ref count with
      sufficient memory (and a filesystem that is intentionally extremely
      slow).
      
      Admittedly it's not exactly easy.  To have more than four billion
      references to a page requires a minimum of 32GB of kernel memory just
      for the pointers to the pages, much less any metadata to keep track of
      those pointers.  Jann needed a total of 140GB of memory and a specially
      crafted filesystem that leaves all reads pending (in order to not ever
      free the page references and just keep adding more).
      
      Still, we have a fairly straightforward way to limit the two obvious
      user-controllable sources of page references: direct-IO like page
      references gotten through get_user_pages(), and the splice pipe page
      duplication.  So let's just do that.
      
      * branch page-refs:
        fs: prevent page refcount overflow in pipe_buf_get
        mm: prevent get_user_pages() from overflowing page refcount
        mm: add 'try_get_page()' helper function
        mm: make page ref count overflow check tighter and more explicit
      6b3a7077
    • Matthew Wilcox's avatar
      fs: prevent page refcount overflow in pipe_buf_get · 15fab63e
      Matthew Wilcox authored
      Change pipe_buf_get() to return a bool indicating whether it succeeded
      in raising the refcount of the page (if the thing in the pipe is a page).
      This removes another mechanism for overflowing the page refcount.  All
      callers converted to handle a failure.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15fab63e
    • Linus Torvalds's avatar
      mm: prevent get_user_pages() from overflowing page refcount · 8fde12ca
      Linus Torvalds authored
      If the page refcount wraps around past zero, it will be freed while
      there are still four billion references to it.  One of the possible
      avenues for an attacker to try to make this happen is by doing direct IO
      on a page multiple times.  This patch makes get_user_pages() refuse to
      take a new page reference if there are already more than two billion
      references to the page.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fde12ca
    • Linus Torvalds's avatar
      mm: add 'try_get_page()' helper function · 88b1a17d
      Linus Torvalds authored
      This is the same as the traditional 'get_page()' function, but instead
      of unconditionally incrementing the reference count of the page, it only
      does so if the count was "safe".  It returns whether the reference count
      was incremented (and is marked __must_check, since the caller obviously
      has to be aware of it).
      
      Also like 'get_page()', you can't use this function unless you already
      had a reference to the page.  The intent is that you can use this
      exactly like get_page(), but in situations where you want to limit the
      maximum reference count.
      
      The code currently does an unconditional WARN_ON_ONCE() if we ever hit
      the reference count issues (either zero or negative), as a notification
      that the conditional non-increment actually happened.
      
      NOTE! The count access for the "safety" check is inherently racy, but
      that doesn't matter since the buffer we use is basically half the range
      of the reference count (ie we look at the sign of the count).
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88b1a17d
    • Linus Torvalds's avatar
      mm: make page ref count overflow check tighter and more explicit · f958d7b5
      Linus Torvalds authored
      We have a VM_BUG_ON() to check that the page reference count doesn't
      underflow (or get close to overflow) by checking the sign of the count.
      
      That's all fine, but we actually want to allow people to use a "get page
      ref unless it's already very high" helper function, and we want that one
      to use the sign of the page ref (without triggering this VM_BUG_ON).
      
      Change the VM_BUG_ON to only check for small underflows (or _very_ close
      to overflowing), and ignore overflows which have strayed into negative
      territory.
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f958d7b5
  5. 13 Apr, 2019 1 commit
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block · 4443f8e6
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Set of fixes that should go into this round. This pull is larger than
        I'd like at this time, but there's really no specific reason for that.
        Some are fixes for issues that went into this merge window, others are
        not. Anyway, this contains:
      
         - Hardware queue limiting for virtio-blk/scsi (Dongli)
      
         - Multi-page bvec fixes for lightnvm pblk
      
         - Multi-bio dio error fix (Jason)
      
         - Remove the cache hint from the io_uring tool side, since we didn't
           move forward with that (me)
      
         - Make io_uring SETUP_SQPOLL root restricted (me)
      
         - Fix leak of page in error handling for pc requests (Jérôme)
      
         - Fix BFQ regression introduced in this merge window (Paolo)
      
         - Fix break logic for bio segment iteration (Ming)
      
         - Fix NVMe cancel request error handling (Ming)
      
         - NVMe pull request with two fixes (Christoph):
             - fix the initial CSN for nvme-fc (James)
             - handle log page offsets properly in the target (Keith)"
      
      * tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
        block: fix the return errno for direct IO
        nvmet: fix discover log page when offsets are used
        nvme-fc: correct csn initialization and increments on error
        block: do not leak memory in bio_copy_user_iov()
        lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
        nvme: cancel request synchronously
        blk-mq: introduce blk_mq_complete_request_sync()
        scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
        virtio-blk: limit number of hw queues by nr_cpu_ids
        block, bfq: fix use after free in bfq_bfqq_expire
        io_uring: restrict IORING_SETUP_SQPOLL to root
        tools/io_uring: remove IOCQE_FLAG_CACHEHIT
        block: don't use for-inside-for in bio_for_each_segment_all
      4443f8e6