• Hugh Dickins's avatar
    mm: rmap use pte lock not mmap_sem to set PageMlocked · b87537d9
    Hugh Dickins authored
    KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
    mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
    found mapped in a VM_LOCKED vma) is ineffective against races with
    exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
    tearing down an mm.
    
    But that's okay, those races are benign; and although we've believed for
    years in that ugly down_read_trylock(), it's unsuitable for the job, and
    frustrates the good intention of setting PageMlocked when it fails.
    
    It just doesn't matter if here we read vm_flags an instant before or after
    a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
    syscalls (or exit) work their way up the address space (taking pt locks
    after updating vm_flags) to establish the final state.
    
    We do still need to be careful never to mark a page Mlocked (hence
    unevictable) by any race that will not be corrected shortly after.  The
    page lock protects from many of the races, but not all (a page is not
    necessarily locked when it's unmapped).  But the pte lock we just dropped
    is good to cover the rest (and serializes even with
    munlock_vma_pages_all(), so no special barriers required): now hold on to
    the pte lock while calling mlock_vma_page().  Is that lock ordering safe?
    Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
    calls the complementary clear_page_mlock().
    
    This fixes the following case (though not a case which anyone has
    complained of), which mmap_sem did not: truncation's preliminary
    unmap_mapping_range() is supposed to remove even the anonymous COWs of
    filecache pages, and that might race with try_to_unmap_one() on a
    VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
    zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
    freed.  The pte lock protects against this.
    
    You could say that it also protects against the more ordinary case, racing
    with the preliminary unmapping of a filecache page itself: but in our
    current tree, that's independently protected by i_mmap_rwsem; and that
    race would be why "Bad page state (mlocked)" was seen before commit
    48ec833b ("Revert mm/memory.c: share the i_mmap_rwsem").
    
    Vlastimil Babka points out another race which this patch protects against.
     try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
    moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
    leaving PageMlocked and unevictable when it should be evictable.  mmap_sem
    is ineffective because exit_mmap() does not hold it; page lock ineffective
    because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
    is effective because __munlock_pagevec_fill() takes it to get the page,
    after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.
    
    Kirill Shutemov points out that if the compiler chooses to implement a
    "vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
    with an intermediate store of unrelated bits set, since I'm here foregoing
    its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
    a spurious VM_LOCKED in vm_flags, and make the wrong decision.  This does
    not appear to be an immediate problem, but we may want to define vm_flags
    accessors in future, to guard against such a possibility.
    
    While we're here, make a related optimization in try_to_munmap_one(): if
    it's doing TTU_MUNLOCK, then there's no point at all in descending the
    page tables and getting the pt lock, unless the vma is VM_LOCKED.  Yes,
    that can change racily, but it can change racily even without the
    optimization: it's not critical.  Far better not to waste time here.
    
    Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
    on this occasion, but that's probably the sensible next step - with a
    rename, given that try_to_munlock()'s business is to try to set Mlocked.
    
    Updated the unevictable-lru Documentation, to remove its reference to mmap
    semaphore, but found a few more updates needed in just that area.
    Signed-off-by: default avatarHugh Dickins <hughd@google.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Sasha Levin <sasha.levin@oracle.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b87537d9
rmap.c 46.9 KB