Commits · 093b3515817bfc96f810012810179ebc3a096528 · Kirill Smelkov / linux

14 May, 2015 9 commits

Andrea Arcangeli authored 10 years ago

This remap ioctl allows to atomically move a page in or out of an
userfaultfd address space. It's more expensive than "copy" (and of
course more expensive than "zerofill") as it requires a TLB flush on
the source range for each ioctl, which is an expensive operation on
SMP. Especially if copying only a few pages at time, copying without
TLB flush is faster.

093b3515

userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE · 729a7f4e

Andrea Arcangeli authored 10 years ago

These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

729a7f4e

userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read · 5a2b3614

Andrea Arcangeli authored 10 years ago

Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

			postcopy_place_page
			read old_state -> MISSING
 			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

 					poll() -> POLLIN
 					read() -> 0x7fb76a139000
 					postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

 			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
			/* check that no userfault raced with UFFDIO_COPY */
			if (old_state == MISSING && tmp_state == REQUESTED)
				UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

			postcopy_place_page
			read old_state -> MISSING
 			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
 			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

 					poll() -> POLLIN
 					read() -> 0x7fb76a139000

 					if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
						UFFDIO_WAKE from userfault thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

5a2b3614

userfaultfd: allocate the userfaultfd_ctx cacheline aligned · bd0a30cd
Andrea Arcangeli authored 10 years ago
```
Use proper slab to guarantee alignment.
```
bd0a30cd
userfaultfd: optimize read() and poll() to be O(1) · 18c5b6c4
Andrea Arcangeli authored 10 years ago
```
This makes read O(1) and poll that was already O(1) becomes lockless.
```
18c5b6c4

userfaultfd: wake pending userfaults · a1837777

Andrea Arcangeli authored 10 years ago

This is an optimization but it's a userland visible one and it affects
the API.

The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.

userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.

The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.

This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.

a1837777

userfaultfd: change the read API to return a uffd_msg · a18d6e1c

Andrea Arcangeli authored 9 years ago

I had requests to return the full address (not the page aligned one)
to userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place
and it actually skip resolving the fault depending on a page
offset. There's currently no real way to skip the fault especially
because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be
retried within the kernel without having to return to userland first
(not even self modifying code replacing the .text that touched the
faulting address would prevent the fault to be repeated). Userland
cannot skip repeating the fault even more so if the fault was
triggered by a KVM secondary page fault or any get_user_pages or any
copy-user inside some syscall which will return to kernel code. The
second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS
being raised because the userfault can't wait if it cannot release the
mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that).

Still returning userland a proper structure during the read() on the
uffd, can allow to use the current UFFD_API for the future
non-cooperative extensions too and it looks cleaner as well. Once we
get additional fields there's no point to return the fault address
page aligned anymore to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead
of 8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future
bits for already shipped events, is limited to 64 by the features
field of the uffdio_api structure. If more will be needed a bump of
UFFD_API will be required.

a18d6e1c

userfaultfd: Rename uffd_api.bits into .features · b9ca6f1f

Pavel Emelyanov authored 9 years ago


This is (seem to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one. Now more bits can
be added to the features field indicating e.g. UFFD_FEATURE_FORK and
others needed for the latter use-case.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

b9ca6f1f

userfaultfd: add new syscall to provide memory externalization · 2f73ffa8

Andrea Arcangeli authored 11 years ago

Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread
responsible for doing the memory externalization can manage the page
faults in userland by talking to the kernel using the userfaultfd
protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

2f73ffa8