• Hugh Dickins's avatar
    mempolicy: mmap_lock is not needed while migrating folios · 72e315f7
    Hugh Dickins authored
    mbind(2) holds down_write of current task's mmap_lock throughout
    (exclusive because it needs to set the new mempolicy on the vmas);
    migrate_pages(2) holds down_read of pid's mmap_lock throughout.
    
    They both hold mmap_lock across the internal migrate_pages(), under which
    all new page allocations (huge or small) are made.  I'm nervous about it;
    and migrate_pages() certainly does not need mmap_lock itself.  It's done
    this way for mbind(2), because its page allocator is vma_alloc_folio() or
    alloc_hugetlb_folio_vma(), both of which depend on vma and address.
    
    Now that we have alloc_pages_mpol(), depending on (refcounted) memory
    policy and interleave index, mbind(2) can be modified to use that or
    alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the
    internal migrate_pages() at all: add alloc_migration_target_by_mpol() to
    replace mbind's new_page().
    
    (After that change, alloc_hugetlb_folio_vma() is used by nothing but a
    userfaultfd function: move it out of hugetlb.h and into the #ifdef.)
    
    migrate_pages(2) has chosen its target node before migrating, so can
    continue to use the standard alloc_migration_target(); but let it take and
    drop mmap_lock just around migrate_to_node()'s queue_pages_range():
    neither the node-to-node calculations nor the page migrations need it.
    
    It seems unlikely, but it is conceivable that some userspace depends on
    the kernel's mmap_lock exclusion here, instead of doing its own locking:
    more likely in a testsuite than in real life.  It is also possible, of
    course, that some pages on the list will be munmapped by another thread
    before they are migrated, or a newer memory policy applied to the range by
    that time: but such races could happen before, as soon as mmap_lock was
    dropped, so it does not appear to be a concern.
    
    Link: https://lkml.kernel.org/r/21e564e8-269f-6a89-7ee2-fd612831c289@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    72e315f7
hugetlb.c 215 KB