• Hugh Dickins's avatar
    mm/khugepaged: retract_page_tables() without mmap or vma lock · 1d65b771
    Hugh Dickins authored
    Simplify shmem and file THP collapse's retract_page_tables(), and relax
    its locking: to improve its success rate and to lessen impact on others.
    
    Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
    target_mm, leave that part of the work to madvise_collapse() calling
    collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result
    code to arrange for that.  That spares retract_page_tables() four
    arguments; and since it will be successful in retracting all of the page
    tables expected of it, no need to track and return a result code itself.
    
    It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
    but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
    allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
    THPs.  retract_page_tables() just needs to use those same spinlocks to
    exclude it briefly, while transitioning pmd from page table to none: so
    restore its use of pmd_lock() inside of which pte lock is nested.
    
    Users of pte_offset_map_lock() etc all now allow for them to fail: so
    retract_page_tables() now has no use for mmap_write_trylock() or
    vma_try_start_write().  In common with rmap and page_vma_mapped_walk(), it
    does not even need the mmap_read_lock().
    
    But those users do expect the page table to remain a good page table,
    until they unlock and rcu_read_unlock(): so the page table cannot be freed
    immediately, but rather by the recently added pte_free_defer().
    
    Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
    when PAE, and pmdp_collapse_flush() did not already do so: to make sure
    that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
    cannot pick up a pmd entry with mismatched pmd_low and pmd_high.
    
    retract_page_tables() can be enhanced to replace_page_tables(), which
    inserts the final huge pmd without mmap lock: going through an invalid
    state instead of pmd_none() followed by fault.  But that enhancement does
    raise some more questions: leave it until a later release.
    
    Link: https://lkml.kernel.org/r/f88970d9-d347-9762-ae6d-da978e8a4df@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    1d65b771
khugepaged.c 74.3 KB