1. 14 May, 2015 9 commits
    • Andrea Arcangeli's avatar
      userfaultfd: UFFDIO_REMAP · 093b3515
      Andrea Arcangeli authored
      This remap ioctl allows to atomically move a page in or out of an
      userfaultfd address space. It's more expensive than "copy" (and of
      course more expensive than "zerofill") as it requires a TLB flush on
      the source range for each ioctl, which is an expensive operation on
      SMP. Especially if copying only a few pages at time, copying without
      TLB flush is faster.
      093b3515
    • Andrea Arcangeli's avatar
      userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE · 729a7f4e
      Andrea Arcangeli authored
      These two ioctl allows to either atomically copy or to map zeropages
      into the virtual address space. This is used by the thread that opened
      the userfaultfd to resolve the userfaults.
      729a7f4e
    • Andrea Arcangeli's avatar
      userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read · 5a2b3614
      Andrea Arcangeli authored
      Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
      userfaultfd_read if they are run on different threads simultaneously.
      
      Until now qemu solved the race in userland: the race was explicitly
      and intentionally left for userland to solve. However we can also
      solve it in kernel.
      
      Requiring all users to solve this race if they use two threads (one
      for the background transfer and one for the userfault reads) isn't
      very attractive from an API prospective, furthermore this allows to
      remove a whole bunch of mutex and bitmap code from qemu, making it
      faster. The cost of __get_user_pages_fast should be insignificant
      considering it scales perfectly and the pagetables are already hot in
      the CPU cache, compared to the overhead in userland to maintain those
      structures.
      
      Applying this patch is backwards compatible with respect to the
      userfaultfd userland API, however reverting this change wouldn't be
      backwards compatible anymore.
      
      Without this patch qemu in the background transfer thread, has to read
      the old state, and do UFFDIO_WAKE if old_state is missing but it
      become REQUESTED by the time it tries to set it to RECEIVED (signaling
      the other side received an userfault).
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
      			postcopy_place_page
      			read old_state -> MISSING
       			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
       					poll() -> POLLIN
       					read() -> 0x7fb76a139000
       					postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
      
       			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
      			/* check that no userfault raced with UFFDIO_COPY */
      			if (old_state == MISSING && tmp_state == REQUESTED)
      				UFFDIO_WAKE from background thread
      
      And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
      			postcopy_place_page
      			read old_state -> MISSING
       			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
       			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
       					poll() -> POLLIN
       					read() -> 0x7fb76a139000
      
       					if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
      						UFFDIO_WAKE from userfault thread
      
      This patch removes the need of both UFFDIO_WAKE and of the associated
      per-page tristate as well.
      5a2b3614
    • Andrea Arcangeli's avatar
      userfaultfd: allocate the userfaultfd_ctx cacheline aligned · bd0a30cd
      Andrea Arcangeli authored
      Use proper slab to guarantee alignment.
      bd0a30cd
    • Andrea Arcangeli's avatar
      userfaultfd: optimize read() and poll() to be O(1) · 18c5b6c4
      Andrea Arcangeli authored
      This makes read O(1) and poll that was already O(1) becomes lockless.
      18c5b6c4
    • Andrea Arcangeli's avatar
      userfaultfd: wake pending userfaults · a1837777
      Andrea Arcangeli authored
      This is an optimization but it's a userland visible one and it affects
      the API.
      
      The downside of this optimization is that if you call poll() and you
      get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
      may be waken by a different thread, before read(ufd) comes
      around. This in short means that poll() isn't really usable if the
      userfaultfd is opened in blocking mode.
      
      userfaults won't wait in "pending" state to be read anymore and any
      UFFDIO_WAKE or similar operations that has the objective of waking
      userfaults after their resolution, will wake all blocked userfaults
      for the resolved range, including those that haven't been read() by
      userland yet.
      
      The behavior of poll() becomes not standard, but this obviates the
      need of "spurious" UFFDIO_WAKE and it lets the userland threads to
      restart immediately without requiring an UFFDIO_WAKE. This is even
      more significant in case of repeated faults on the same address from
      multiple threads.
      
      This optimization is justified by the measurement that the number of
      spurious UFFDIO_WAKE accounts for 5% and 10% of the total
      userfaults for heavy workloads, so it's worth optimizing those away.
      a1837777
    • Andrea Arcangeli's avatar
      userfaultfd: change the read API to return a uffd_msg · a18d6e1c
      Andrea Arcangeli authored
      I had requests to return the full address (not the page aligned one)
      to userland.
      
      It's not entirely clear how the page offset could be relevant because
      userfaults aren't like SIGBUS that can sigjump to a different place
      and it actually skip resolving the fault depending on a page
      offset. There's currently no real way to skip the fault especially
      because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be
      retried within the kernel without having to return to userland first
      (not even self modifying code replacing the .text that touched the
      faulting address would prevent the fault to be repeated). Userland
      cannot skip repeating the fault even more so if the fault was
      triggered by a KVM secondary page fault or any get_user_pages or any
      copy-user inside some syscall which will return to kernel code. The
      second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS
      being raised because the userfault can't wait if it cannot release the
      mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that).
      
      Still returning userland a proper structure during the read() on the
      uffd, can allow to use the current UFFD_API for the future
      non-cooperative extensions too and it looks cleaner as well. Once we
      get additional fields there's no point to return the fault address
      page aligned anymore to reuse the bits below PAGE_SHIFT.
      
      The only downside is that the read() syscall will read 32bytes instead
      of 8bytes but that's not going to be measurable overhead.
      
      The total number of new events that can be extended or of new future
      bits for already shipped events, is limited to 64 by the features
      field of the uffdio_api structure. If more will be needed a bump of
      UFFD_API will be required.
      a18d6e1c
    • Pavel Emelyanov's avatar
      userfaultfd: Rename uffd_api.bits into .features · b9ca6f1f
      Pavel Emelyanov authored
      
      This is (seem to be) the minimal thing that is required to unblock
      standard uffd usage from the non-cooperative one. Now more bits can
      be added to the features field indicating e.g. UFFD_FEATURE_FORK and
      others needed for the latter use-case.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      b9ca6f1f
    • Andrea Arcangeli's avatar
      userfaultfd: add new syscall to provide memory externalization · 2f73ffa8
      Andrea Arcangeli authored
      Once an userfaultfd has been created and certain region of the process
      virtual address space have been registered into it, the thread
      responsible for doing the memory externalization can manage the page
      faults in userland by talking to the kernel using the userfaultfd
      protocol.
      
      poll() can be used to know when there are new pending userfaults to be
      read (POLLIN).
      2f73ffa8