• Hugh Dickins's avatar
    mm: use pmdp_get_lockless() without surplus barrier() · 26e1a0c3
    Hugh Dickins authored
    Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.
    
    What is it all about?  Some mmap_lock avoidance i.e.  latency reduction. 
    Initially just for the case of collapsing shmem or file pages to THPs; but
    likely to be relied upon later in other contexts e.g.  freeing of empty
    page tables (but that's not work I'm doing).  mmap_write_lock avoidance
    when collapsing to anon THPs?  Perhaps, but again that's not work I've
    done: a quick attempt was not as easy as the shmem/file case.
    
    I would much prefer not to have to make these small but wide-ranging
    changes for such a niche case; but failed to find another way, and have
    heard that shmem MADV_COLLAPSE's usefulness is being limited by that
    mmap_write_lock it currently requires.
    
    These changes (though of course not these exact patches) have been in
    Google's data centre kernel for three years now: we do rely upon them.
    
    What is this preparatory series about?
    
    The current mmap locking will not be enough to guard against that tricky
    transition between pmd entry pointing to page table, and empty pmd entry,
    and pmd entry pointing to huge page: pte_offset_map() will have to
    validate the pmd entry for itself, returning NULL if no page table is
    there.  What to do about that varies: sometimes nearby error handling
    indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
    again" is appropriate (and if that risks an infinite loop, then there must
    have been an oops, or pfn 0 mistaken for page table, before).
    
    Given the likely extension to freeing empty page tables, I have not
    limited this set of changes to a THP config; and it has been easier, and
    sets a better example, if each site is given appropriate handling: even
    where deeper study might prove that failure could only happen if the pmd
    table were corrupted.
    
    Several of the patches are, or include, cleanup on the way; and by the
    end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
    pte_offset_map_lock() then handle those original races and more.  Most
    uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
    its place.
    
    
    This patch (of 32):
    
    Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
    reliable result with PAE (or READ_ONCE as before without PAE); and remove
    the unnecessary extra barrier()s which got left behind in its callers.
    
    HOWEVER: Note the small print in linux/pgtable.h, where it was designed
    specifically for fast GUP, and depends on interrupts being disabled for
    its full guarantee: most callers which have been added (here and before)
    do NOT have interrupts disabled, so there is still some need for caution.
    
    Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
    Acked-by: default avatarYu Zhao <yuzhao@google.com>
    Acked-by: default avatarPeter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    26e1a0c3
memory.c 162 KB