• David Hildenbrand's avatar
    mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed · 322842ea
    David Hildenbrand authored
    Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4.
    
    This series is the result of the discussion on the previous approach [2]. 
    More information on the general COW issues can be found there.  It is
    based on latest linus/master (post v5.17, with relevant core-MM changes
    for v5.18-rc1).
    
    This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
    on an anonymous page and COW logic fails to detect exclusivity of the page
    to then replacing the anonymous page by a copy in the page table: The GUP
    pin lost synchronicity with the pages mapped into the page tables.
    
    This issue, including other related COW issues, has been summarized in [3]
    under 3):
    "
      3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
    
      page_maybe_dma_pinned() is used to check if a page may be pinned for
      DMA (using FOLL_PIN instead of FOLL_GET).  While false positives are
      tolerable, false negatives are problematic: pages that are pinned for
      DMA must not be added to the swapcache.  If it happens, the (now pinned)
      page could be faulted back from the swapcache into page tables
      read-only.  Future write-access would detect the pinning and COW the
      page, losing synchronicity.  For the interested reader, this is nicely
      documented in feb889fb ("mm: don't put pinned pages into the swap
      cache").
    
      Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
      cases and can result in a violation of the documented semantics: giving
      false negatives because of the race.
    
      There are cases where we call it without properly taking a per-process
      sequence lock, turning the usage of page_maybe_dma_pinned() racy.  While
      one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
      handle, there is especially one rmap case (shrink_page_list) that's hard
      to fix: in the rmap world, we're not limited to a single process.
    
      The shrink_page_list() issue is really subtle.  If we race with
      someone pinning a page, we can trigger the same issue as in the FOLL_GET
      case.  See the detail section at the end of this mail on a discussion
      how bad this can bite us with VFIO or other FOLL_PIN user.
    
      It's harder to reproduce, but I managed to modify the O_DIRECT
      reproducer to use io_uring fixed buffers [15] instead, which ends up
      using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
      similarly trigger a loss of synchronicity and consequently a memory
      corruption.
    
      Again, the root issue is that a write-fault on a page that has
      additional references results in a COW and thereby a loss of
      synchronicity and consequently a memory corruption if two parties
      believe they are referencing the same page.
    "
    
    This series makes GUP pins (R/O and R/W) on anonymous pages fully
    reliable, especially also taking care of concurrent pinning via GUP-fast,
    for example, also fully fixing an issue reported regarding NUMA balancing
    [4] recently.  While doing that, it further reduces "unnecessary COWs",
    especially when we don't fork()/KSM and don't swapout, and fixes the COW
    security for hugetlb for FOLL_PIN.
    
    In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
    anonymous page is exclusive.  Exclusive anonymous pages that are mapped
    R/O can directly be mapped R/W by the COW logic in the write fault
    handler.  Exclusive anonymous pages that want to be shared (fork(), KSM)
    first have to be marked shared -- which will fail if there are GUP pins on
    the page.  GUP is only allowed to take a pin on anonymous pages that are
    exclusive.  The PT lock is the primary mechanism to synchronize
    modifications of PG_anon_exclusive.  We synchronize against GUP-fast
    either via the src_mm->write_protect_seq (during fork()) or via
    clear/invalidate+flush of the relevant page table entry.
    
    Special care has to be taken about swap, migration, and THPs (whereby a
    PMD-mapping can be converted to a PTE mapping and we have to track
    information for subpages).  Besides these, we let the rmap code handle
    most magic.  For reliable R/O pins of anonymous pages, we need
    FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however,
    it's now 100% mapcount free and I further simplified it a bit.
    
      #1 is a fix
      #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
      #11 introduces PG_anon_exclusive
      #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
       reliable
      #13 is a preparation for reliable R/O pins
      #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
       make R/O pins of anonymous pages reliable
      #16 adds sanity check when (un)pinning anonymous pages
    
    [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
    [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
    [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
    [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
    
    
    This patch (of 17):
    
    In case arch_unmap_one() fails, we already did a swap_duplicate().  let's
    undo that properly via swap_free().
    
    Link: https://lkml.kernel.org/r/20220428083441.37290-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20220428083441.37290-2-david@redhat.com
    Fixes: ca827d55
    
     ("mm, swap: Add infrastructure for saving page metadata on swap")
    Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
    Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Jann Horn <jannh@google.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    322842ea
rmap.c 69.4 KB