• Peter Xu's avatar
    mm: introduce PTE_MARKER swap entry · 679d1033
    Peter Xu authored
    Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8.
    
    
    Overview
    ========
    
    Userfaultfd-wp anonymous support was merged two years ago.  There're quite
    a few applications that started to leverage this capability either to take
    snapshots for user-app memory, or use it for full user controled swapping.
    
    This series tries to complete the feature for uffd-wp so as to cover all
    the RAM-based memory types.  So far uffd-wp is the only missing piece of
    the rest features (uffd-missing & uffd-minor mode).
    
    One major reason to do so is that anonymous pages are sometimes not
    satisfying the need of applications, and there're growing users of either
    shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
    between hypervisor process and device emulation process, shmem local live
    migration for upgrades), or for performance on tlb hits.
    
    All these mean that if a uffd-wp app wants to switch to any of the memory
    types, it'll stop working.  I think it's worthwhile to have the kernel to
    cover all these aspects.
    
    This series chose to protect pages in pte level not page level.
    
    One major reason is safety.  I have no idea how we could make it safe if
    any of the uffd-privileged app can wr-protect a page that any other
    application can use.  It means this app can block any process potentially
    for any time it wants.
    
    The other reason is that it aligns very well with not only the anonymous
    uffd-wp solution, but also uffd as a whole.  For example, userfaultfd is
    implemented fundamentally based on VMAs.  We set flags to VMAs showing the
    status of uffd tracking.  For another per-page based protection solution,
    it'll be crossing the fundation line on VMA-based, and it could simply be
    too far away already from what's called userfaultfd.
    
    PTE markers
    ===========
    
    The patchset is based on the idea called PTE markers.  It was discussed in
    one of the mm alignment sessions, proposed starting from v6, and this is
    the 2nd version of it using PTE marker idea.
    
    PTE marker is a new type of swap entry that is ony applicable to file
    backed memories like shmem and hugetlbfs.  It's used to persist some
    pte-level information even if the original present ptes in pgtable are
    zapped.
    
    Logically pte markers can store more than uffd-wp information, but so far
    only one bit is used for uffd-wp purpose.  When the pte marker is
    installed with uffd-wp bit set, it means this pte is wr-protected by uffd.
    
    It solves the problem on e.g.  file-backed memory mapped ptes got zapped
    due to any reason (e.g.  thp split, or swapped out), we can still keep the
    wr-protect information in the ptes.  Then when the page fault triggers
    again, we'll know this pte is wr-protected so we can treat the pte the
    same as a normal uffd wr-protected pte.
    
    The extra information is encoded into the swap entry, or swp_offset to be
    explicit, with the swp_type being PTE_MARKER.  So far uffd-wp only uses
    one bit out of the swap entry, the rest bits of swp_offset are still
    reserved for other purposes.
    
    There're two configs to enable/disable PTE markers:
    
      CONFIG_PTE_MARKER
      CONFIG_PTE_MARKER_UFFD_WP
    
    We can set !PTE_MARKER to completely disable all the PTE markers, along
    with uffd-wp support.  I made two config so we can also enable PTE marker
    but disable uffd-wp file-backed for other purposes.  At the end of current
    series, I'll enable CONFIG_PTE_MARKER by default, but that patch is
    standalone and if anyone worries about having it by default, we can also
    consider turn it off by dropping that oneliner patch.  So far I don't see
    a huge risk of doing so, so I kept that patch.
    
    In most cases, PTE markers should be treated as none ptes.  It is because
    that unlike most of the other swap entry types, there's no PFN or block
    offset information encoded into PTE markers but some extra well-defined
    bits showing the status of the pte.  These bits should only be used as
    extra data when servicing an upcoming page fault, and then we behave as if
    it's a none pte.
    
    I did spend a lot of time observing all the pte_none() users this time. 
    It is indeed a challenge because there're a lot, and I hope I didn't miss
    a single of them when we should take care of pte markers.  Luckily, I
    don't think it'll need to be considered in many cases, for example: boot
    code, arch code (especially non-x86), kernel-only page handlings (e.g. 
    CPA), or device driver codes when we're tackling with pure PFN mappings.
    
    I introduced pte_none_mostly() in this series when we need to handle pte
    markers the same as none pte, the "mostly" is the other way to write
    "either none pte or a pte marker".
    
    I didn't replace pte_none() to cover pte markers for below reasons:
    
      - Very rare case of pte_none() callers will handle pte markers.  E.g., all
        the kernel pages do not require knowledge of pte markers.  So we don't
        pollute the major use cases.
    
      - Unconditionally change pte_none() semantics could confuse people, because
        pte_none() existed for so long a time.
    
      - Unconditionally change pte_none() semantics could make pte_none() slower
        even if in many cases pte markers do not exist.
    
      - There're cases where we'd like to handle pte markers differntly from
        pte_none(), so a full replace is also impossible.  E.g. khugepaged should
        still treat pte markers as normal swap ptes rather than none ptes, because
        pte markers will always need a fault-in to merge the marker with a valid
        pte.  Or the smap code will need to parse PTE markers not none ptes.
    
    Patch Layout
    ============
    
    Introducing PTE marker and uffd-wp bit in PTE marker:
    
      mm: Introduce PTE_MARKER swap entry
      mm: Teach core mm about pte markers
      mm: Check against orig_pte for finish_fault()
      mm/uffd: PTE_MARKER_UFFD_WP
    
    Adding support for shmem uffd-wp:
    
      mm/shmem: Take care of UFFDIO_COPY_MODE_WP
      mm/shmem: Handle uffd-wp special pte in page fault handler
      mm/shmem: Persist uffd-wp bit across zapping for file-backed
      mm/shmem: Allow uffd wr-protect none pte for file-backed mem
      mm/shmem: Allows file-back mem to be uffd wr-protected on thps
      mm/shmem: Handle uffd-wp during fork()
    
    Adding support for hugetlbfs uffd-wp:
    
      mm/hugetlb: Introduce huge pte version of uffd-wp helpers
      mm/hugetlb: Hook page faults for uffd write protection
      mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
      mm/hugetlb: Handle UFFDIO_WRITEPROTECT
      mm/hugetlb: Handle pte markers in page faults
      mm/hugetlb: Allow uffd wr-protect none ptes
      mm/hugetlb: Only drop uffd-wp special pte if required
      mm/hugetlb: Handle uffd-wp during fork()
    
    Misc handling on the rest mm for uffd-wp file-backed:
    
      mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
      mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
    
    Enabling of uffd-wp on file-backed memory:
    
      mm/uffd: Enable write protection for shmem & hugetlbfs
      mm: Enable PTE markers by default
      selftests/uffd: Enable uffd-wp for shmem/hugetlbfs
    
    Tests
    =====
    
    - Compile test on x86_64 and aarch64 on different configs
    - Kernel selftests
    - uffd-test [0]
    - Umapsort [1,2] test for shmem/hugetlb, with swap on/off
    
    [0] https://github.com/xzpeter/clibs/tree/master/uffd-test
    [1] https://github.com/xzpeter/umap-apps/tree/peter
    [2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs
    
    
    This patch (of 23):
    
    Introduces a new swap entry type called PTE_MARKER.  It can be installed
    for any pte that maps a file-backed memory when the pte is temporarily
    zapped, so as to maintain per-pte information.
    
    The information that kept in the pte is called a "marker".  Here we define
    the marker as "unsigned long" just to match pgoff_t, however it will only
    work if it still fits in swp_offset(), which is e.g.  currently 58 bits on
    x86_64.
    
    A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A
    bunch of helpers are defined altogether to service the rest of the pte
    marker code.
    
    [peterx@redhat.com: fixup]
      Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local
    Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    679d1033
hugetlb.h 2.93 KB