Commit f3c64eda authored by Linus Torvalds's avatar Linus Torvalds

mm: avoid early COW write protect games during fork()

In commit 70e806e4 ("mm: Do early cow for pinned pages during fork()
for ptes") we write-protected the PTE before doing the page pinning
check, in order to avoid a race with concurrent fast-GUP pinning (which
doesn't take the mm semaphore or the page table lock).

That trick doesn't actually work - it doesn't handle memory ordering
properly, and doing so would be prohibitively expensive.

It also isn't really needed.  While we're moving in the direction of
allowing and supporting page pinning without marking the pinned area
with MADV_DONTFORK, the fact is that we've never really supported this
kind of odd "concurrent fork() and page pinning", and doing the
serialization on a pte level is just wrong.

We can add serialization with a per-mm sequence counter, so we know how
to solve that race properly, but we'll do that at a more appropriate
time.  Right now this just removes the write protect games.

It also turns out that the write protect games actually break on Power,
as reported by Aneesh Kumar:

 "Architecture like ppc64 expects set_pte_at to be not used for updating
  a valid pte. This is further explained in commit 56eecdb9 ("mm:
  Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"

and the code triggered a warning there:

  WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
  Call Trace:
    copy_present_page mm/memory.c:857 [inline]
    copy_present_pte mm/memory.c:899 [inline]
    copy_pte_range mm/memory.c:1014 [inline]
    copy_pmd_range mm/memory.c:1092 [inline]
    copy_pud_range mm/memory.c:1127 [inline]
    copy_p4d_range mm/memory.c:1150 [inline]
    copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
    dup_mmap kernel/fork.c:592 [inline]
    dup_mm+0x77c/0xab0 kernel/fork.c:1355
    copy_mm kernel/fork.c:1411 [inline]
    copy_process+0x1f00/0x2740 kernel/fork.c:2070
    _do_fork+0xc4/0x10b0 kernel/fork.c:2429

Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: default avatarLeon Romanovsky <leonro@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent c85fb28b
......@@ -806,8 +806,6 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return 1;
/*
* The trick starts.
*
* What we want to do is to check whether this page may
* have been pinned by the parent process. If so,
* instead of wrprotect the pte on both sides, we copy
......@@ -815,47 +813,16 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* the pinned page won't be randomly replaced in the
* future.
*
* To achieve this, we do the following:
*
* 1. Write-protect the pte if it's writable. This is
* to protect concurrent write fast-gup with
* FOLL_PIN, so that we'll fail the fast-gup with
* the write bit removed.
*
* 2. Check page_maybe_dma_pinned() to see whether this
* page may have been pinned.
*
* The order of these steps is important to serialize
* against the fast-gup code (gup_pte_range()) on the
* pte check and try_grab_compound_head(), so that
* we'll make sure either we'll capture that fast-gup
* so we'll copy the pinned page here, or we'll fail
* that fast-gup.
*
* NOTE! Even if we don't end up copying the page,
* we won't undo this wrprotect(), because the normal
* reference copy will need it anyway.
*/
if (pte_write(pte))
ptep_set_wrprotect(src_mm, addr, src_pte);
/*
* These are the "normally we can just copy by reference"
* checks.
* The page pinning checks are just "has this mm ever
* seen pinning", along with the (inexact) check of
* the page count. That might give false positives for
* for pinning, but it will work correctly.
*/
if (likely(!atomic_read(&src_mm->has_pinned)))
return 1;
if (likely(!page_maybe_dma_pinned(page)))
return 1;
/*
* Uhhuh. It looks like the page might be a pinned page,
* and we actually need to copy it. Now we can set the
* source pte back to being writable.
*/
if (pte_write(pte))
set_pte_at(src_mm, addr, src_pte, pte);
new_page = *prealloc;
if (!new_page)
return -EAGAIN;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment