• Peter Xu's avatar
    mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud · 6857be5f
    Peter Xu authored
    Patch series "mm: Support huge pfnmaps", v2.
    
    Overview
    ========
    
    This series implements huge pfnmaps support for mm in general.  Huge
    pfnmap allows e.g.  VM_PFNMAP vmas to map in either PMD or PUD levels,
    similar to what we do with dax / thp / hugetlb so far to benefit from TLB
    hits.  Now we extend that idea to PFN mappings, e.g.  PCI MMIO bars where
    it can grow as large as 8GB or even bigger.
    
    Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
    patch (from Alex Williamson) will be the first user of huge pfnmap, so as
    to enable vfio-pci driver to fault in huge pfn mappings.
    
    Implementation
    ==============
    
    In reality, it's relatively simple to add such support comparing to many
    other types of mappings, because of PFNMAP's specialties when there's no
    vmemmap backing it, so that most of the kernel routines on huge mappings
    should simply already fail for them, like GUPs or old-school follow_page()
    (which is recently rewritten to be folio_walk* APIs by David).
    
    One trick here is that we're still unmature on PUDs in generic paths here
    and there, as DAX is so far the only user.  This patchset will add the 2nd
    user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
    go on smoothly, but to be discussed later.
    
    The other trick is how to allow gup-fast working for such huge mappings
    even if there's no direct sign of knowing whether it's a normal page or
    MMIO mapping.  This series chose to keep the pte_special solution, so that
    it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
    that gup-fast will be able to identify them and fail properly.
    
    Along the way, we'll also notice that the major pgtable pfn walker, aka,
    follow_pte(), will need to retire soon due to the fact that it only works
    with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
    be able to do whatever follow_pte() can already do, plus that it can also
    process huge pfnmaps now.  Half of this series is about that and
    converting all existing pfnmap walkers to use the new API properly. 
    Hopefully the new API also looks better to avoid exposing e.g.  pgtable
    lock details into the callers, so that it can be used in an even more
    straightforward way.
    
    Here, three more options will be introduced and involved in huge pfnmap:
    
      - ARCH_SUPPORTS_HUGE_PFNMAP
    
        Arch developers will need to select this option when huge pfnmap is
        supported in arch's Kconfig.  After this patchset applied, both x86_64
        and arm64 will start to enable it by default.
    
      - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
    
        These options are for driver developers to identify whether current
        arch / config supports huge pfnmaps, making decision on whether it can
        use the huge pfnmap APIs to inject them.  One can refer to the last
        vfio-pci patch from Alex on the use of them properly in a device
        driver.
    
    So after the whole set applied, and if one would enable some dynamic debug
    lines in vfio-pci core files, we should observe things like:
    
      vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
      vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
      vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
    
    In this specific case, it says that vfio-pci faults in PMDs properly for a
    few BAR0 offsets.
    
    Patch Layout
    ============
    
    Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
    Patch 2:         A tiny cleanup
    Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                     special bit for pmd/pud)
    Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                     then drop follow_pte() API
    Patch 17:        Add huge pfnmap support for x86_64
    Patch 18:        Add huge pfnmap support for arm64
    Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)
    
    TODO
    ====
    
    More architectures / More page sizes
    ------------------------------------
    
    Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems
    to have plan to support arm64 1G later on top of this series [2].
    
    Any arch will need to first support THP / THP_1G, then provide a special
    bit in pmds/puds to support huge pfnmaps.
    
    remap_pfn_range() support
    -------------------------
    
    Currently, remap_pfn_range() still only maps PTEs.  With the new option,
    remap_pfn_range() can logically start to inject either PMDs or PUDs when
    the alignment requirements match on the VAs.
    
    When the support is there, it should be able to silently benefit all
    drivers that is using remap_pfn_range() in its mmap() handler on better
    TLB hit rate and overall faster MMIO accesses similar to processor on
    hugepages.
    
    More driver support
    -------------------
    
    VFIO is so far the only consumer for the huge pfnmaps after this series
    applied.  Besides above remap_pfn_range() generic optimization, device
    driver can also try to optimize its mmap() on a better VA alignment for
    either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
    as the driver doesn't normally decide the VA to map a bar.  But I don't
    think I know all the drivers to know the full picture.
    
    Credits all go to Alex on help testing the GPU/NIC use cases above.
    
    [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
    [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
    [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
    
    
    This patch (of 19):
    
    This patch introduces the option to introduce special pte bit into
    pmd/puds.  Archs can start to define pmd_special / pud_special when
    supported by selecting the new option.  Per-arch support will be added
    later.
    
    Before that, create fallbacks for these helpers so that they are always
    available.
    
    Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gavin Shan <gshan@redhat.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Niklas Schnelle <schnelle@linux.ibm.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    6857be5f
Kconfig 39.6 KB