1. 07 Apr, 2020 40 commits
    • Peter Xu's avatar
      userfaultfd: wp: add pmd_swp_*uffd_wp() helpers · 2e3d5dc5
      Peter Xu authored
      Adding these missing helpers for uffd-wp operations with pmd
      swap/migration entries.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e3d5dc5
    • Peter Xu's avatar
      userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork · b569a176
      Peter Xu authored
      UFFD_EVENT_FORK support for uffd-wp should be already there, except that
      we should clean the uffd-wp bit if uffd fork event is not enabled.  Detect
      that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
      by VM_UFFD_WP.  Do this for both small PTEs and huge PMDs.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b569a176
    • Peter Xu's avatar
      userfaultfd: wp: apply _PAGE_UFFD_WP bit · 292924b2
      Peter Xu authored
      Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
      change_protection() when used with uffd-wp and make sure the two new flags
      are exclusively used.  Then,
      
        - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
          when a range of memory is write protected by uffd
      
        - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
          _PAGE_RW when write protection is resolved from userspace
      
      And use this new interface in mwriteprotect_range() to replace the old
      MM_CP_DIRTY_ACCT.
      
      Do this change for both PTEs and huge PMDs.  Then we can start to identify
      which PTE/PMD is write protected by general (e.g., COW or soft dirty
      tracking), and which is for userfaultfd-wp.
      
      Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
      into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
      can be even more strict when detecting uffd-wp page faults in either
      do_wp_page() or wp_huge_pmd().
      
      After we're with _PAGE_UFFD_WP, a special case is when a page is both
      protected by the general COW logic and also userfault-wp.  Here the
      userfault-wp will have higher priority and will be handled first.  Only
      after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
      the general COW.  These are the steps on what will happen with such a
      page:
      
        1. CPU accesses write protected shared page (so both protected by
           general COW and uffd-wp), blocked by uffd-wp first because in
           do_wp_page we'll handle uffd-wp first, so it has higher priority
           than general COW.
      
        2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
           to remove the uffd-wp bit upon the PTE/PMD.  However here we
           still keep the write bit cleared.  Notify the blocked CPU.
      
        3. The blocked CPU resumes the page fault process with a fault
           retry, during retry it'll notice it was not with the uffd-wp bit
           this time but it is still write protected by general COW, then
           it'll go though the COW path in the fault handler, copy the page,
           apply write bit where necessary, and retry again.
      
        4. The CPU will be able to access this page with write bit set.
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      292924b2
    • Peter Xu's avatar
      mm: merge parameters for change_protection() · 58705444
      Peter Xu authored
      change_protection() was used by either the NUMA or mprotect() code,
      there's one parameter for each of the callers (dirty_accountable and
      prot_numa).  Further, these parameters are passed along the calls:
      
        - change_protection_range()
        - change_p4d_range()
        - change_pud_range()
        - change_pmd_range()
        - ...
      
      Now we introduce a flag for change_protect() and all these helpers to
      replace these parameters.  Then we can avoid passing multiple parameters
      multiple times along the way.
      
      More importantly, it'll greatly simplify the work if we want to introduce
      any new parameters to change_protection().  In the follow up patches, a
      new parameter for userfaultfd write protection will be introduced.
      
      No functional change at all.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58705444
    • Andrea Arcangeli's avatar
      userfaultfd: wp: add UFFDIO_COPY_MODE_WP · 72981e0e
      Andrea Arcangeli authored
      This allows UFFDIO_COPY to map pages write-protected.
      
      [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
       around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
       commit messages]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72981e0e
    • Andrea Arcangeli's avatar
      userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers · 55adf4de
      Andrea Arcangeli authored
      Implement helpers methods to invoke userfaultfd wp faults more
      selectively: not only when a wp fault triggers on a vma with vma->vm_flags
      VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set in the pagetable
      too.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-5-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55adf4de
    • Andrea Arcangeli's avatar
      userfaultfd: wp: add WP pagetable tracking to x86 · 5a281062
      Andrea Arcangeli authored
      Accurate userfaultfd WP tracking is possible by tracking exactly which
      virtual memory ranges were writeprotected by userland.  We can't relay
      only on the RW bit of the mapped pagetable because that information is
      destroyed by fork() or KSM or swap.  If we were to relay on that, we'd
      need to stay on the safe side and generate false positive wp faults for
      every swapped out page.
      
      [peterx@redhat.com: append _PAGE_UFD_WP to _PAGE_CHG_MASK]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-4-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a281062
    • Andrea Arcangeli's avatar
      userfaultfd: wp: hook userfault handler to write protection fault · 529b930b
      Andrea Arcangeli authored
      There are several cases write protection fault happens.  It could be a
      write to zero page, swaped page or userfault write protected page.  When
      the fault happens, there is no way to know if userfault write protect the
      page before.  Here we just blindly issue a userfault notification for vma
      with VM_UFFD_WP regardless if app write protects it yet.  Application
      should be ready to handle such wp fault.
      
      In the swapin case, always swapin as readonly.  This will cause false
      positive userfaults.  We need to decide later if to eliminate them with a
      flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
      
      hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be
      handled by a swap entry bit like anonymous memory.
      
      The main problem with no easy solution to eliminate the false positives,
      will be if/when userfaultfd is extended to real filesystem pagecache.
      When the pagecache is freed by reclaim we can't leave the radix tree
      pinned if the inode and in turn the radix tree is reclaimed as well.
      
      The estimation is that full accuracy and lack of false positives could be
      easily provided only to anonymous memory (as long as there's no fork or as
      long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs
      and hugetlbfs, it's most certainly worth to achieve it but in a later
      incremental patch.
      
      [peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      529b930b
    • Shaohua Li's avatar
      userfaultfd: wp: add helper for writeprotect check · 1df319e0
      Shaohua Li authored
      Patch series "userfaultfd: write protection support", v6.
      
      Overview
      ========
      
      The uffd-wp work was initialized by Shaohua Li [1], and later continued by
      Andrea [2].  This series is based upon Andrea's latest userfaultfd tree,
      and it is a continuous works from both Shaohua and Andrea.  Many of the
      follow up ideas come from Andrea too.
      
      Besides the old MISSING register mode of userfaultfd, the new uffd-wp
      support provides another alternative register mode called
      UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
      page faults but also write protection page faults, or even they can be
      registered together.  At the same time, the new feature also provides a
      new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
      userspace to write protect a range or memory or fixup write permission of
      faulted pages.
      
      Please refer to the document patch "userfaultfd: wp:
      UFFDIO_REGISTER_MODE_WP documentation update" for more information on the
      new interface and what it can do.
      
      The major workflow of an uffd-wp program should be:
      
        1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP
      
        2. Write protect part of the whole registered region using
           UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
           show that we want to write protect the range.
      
        3. Start a working thread that modifies the protected pages,
           meanwhile listening to UFFD messages.
      
        4. When a write is detected upon the protected range, page fault
           happens, a UFFD message will be generated and reported to the
           page fault handling thread
      
        5. The page fault handler thread resolves the page fault using the
           new UFFDIO_WRITEPROTECT ioctl, but this time passing in
           !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
           recover the write permission.  Before this operation, the fault
           handler thread can do anything it wants, e.g., dumps the page to
           a persistent storage.
      
        6. The worker thread will continue running with the correctly
           applied write permission from step 5.
      
      Currently there are already two projects that are based on this new
      userfaultfd feature.
      
      QEMU Live Snapshot: The project provides a way to allow the QEMU
                          hypervisor to take snapshot of VMs without
                          stopping the VM [3].
      
      LLNL umap library:  The project provides a mmap-like interface and
                          "allow to have an application specific buffer of
                          pages cached from a large file, i.e. out-of-core
                          execution using memory map" [4][5].
      
      Before posting the patchset, this series was smoke tested against QEMU
      live snapshot and the LLNL umap library (by doing parallel quicksort using
      128 sorting threads + 80 uffd servicing threads).  My sincere thanks to
      Marty Mcfadden and Denis Plotnikov for the help along the way.
      
      TODO
      ====
      
      - hugetlbfs/shmem support
      - performance
      - more architectures
      - cooperate with mprotect()-allowed processes (???)
      - ...
      
      References
      ==========
      
      [1] https://lwn.net/Articles/666187/
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
      [3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
      [4] https://github.com/LLNL/umap
      [5] https://llnl-umap.readthedocs.io/en/develop/
      [6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
      [7] https://lkml.org/lkml/2018/11/21/370
      [8] https://lkml.org/lkml/2018/12/30/64
      
      This patch (of 19):
      
      Add helper for writeprotect check. Will use it later.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-2-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1df319e0
    • David Hildenbrand's avatar
      virtio-balloon: switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM · da10329c
      David Hildenbrand authored
      Commit 71994620 ("virtio_balloon: replace oom notifier with shrinker")
      changed the behavior when deflation happens automatically.  Instead of
      deflating when called by the OOM handler, the shrinker is used.
      
      However, the balloon is not simply some other slab cache that should be
      shrunk when under memory pressure.  The shrinker does not have a concept
      of priorities yet, so this behavior cannot be configured.  Eventually once
      that is in place, we might want to switch back after doing proper testing.
      
      There was a report that this results in undesired side effects when
      inflating the balloon to shrink the page cache. [1]
      	"When inflating the balloon against page cache (i.e. no free memory
      	 remains) vmscan.c will both shrink page cache, but also invoke the
      	 shrinkers -- including the balloon's shrinker. So the balloon
      	 driver allocates memory which requires reclaim, vmscan gets this
      	 memory by shrinking the balloon, and then the driver adds the
      	 memory back to the balloon. Basically a busy no-op."
      
      The name "deflate on OOM" makes it pretty clear when deflation should
      happen - after other approaches to reclaim memory failed, not while
      reclaiming. This allows to minimize the footprint of a guest - memory
      will only be taken out of the balloon when really needed.
      
      Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because
      this has no such side effects. Always register the shrinker with
      VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free
      pages that are still to be processed by the guest. The hypervisor takes
      care of identifying and resolving possible races between processing a
      hinting request and the guest reusing a page.
      
      In contrast to pre commit 71994620 ("virtio_balloon: replace oom
      notifier with shrinker"), don't add a module parameter to configure the
      number of pages to deflate on OOM. Can be re-added if really needed.
      Also, pay attention that leak_balloon() returns the number of 4k pages -
      convert it properly in virtio_balloon_oom_notify().
      
      Testing done by Tyler for future reference:
        Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42
        GB file full of random bytes that we continually cat to /dev/null.
        This fills the page cache as the file is read. Meanwhile, we trigger
        the balloon to inflate, with a target size of 53 GB. This setup causes
        the balloon inflation to pressure the page cache as the page cache is
        also trying to grow. Afterwards we shrink the balloon back to zero (so
        total deflate == total inflate).
      
        Without this patch (kernel 4.19.0-5):
        Inflation never reaches the target until we stop the "cat file >
        /dev/null" process. Total inflation time was 542 seconds. The longest
        period that made no net forward progress was 315 seconds.
          Result of "grep balloon /proc/vmstat" after the test:
          balloon_inflate 154828377
          balloon_deflate 154828377
      
        With this patch (kernel 5.6.0-rc4+):
        Total inflation duration was 63 seconds. No deflate-queue activity
        occurs when pressuring the page-cache.
          Result of "grep balloon /proc/vmstat" after the test:
          balloon_inflate 12968539
          balloon_deflate 12968539
      
        Conclusion: This patch fixes the issue.  In the test it reduced
        inflate/deflate activity by 12x, and reduced inflation time by 8.6x.
        But more importantly, if we hadn't killed the "cat file > /dev/null"
        process then, without the patch, the inflation process would never reach
        the target.
      
      [1] https://www.spinics.net/lists/linux-virtualization/msg40863.html
      
      Link: http://lkml.kernel.org/r/20200311135523.18512-2-david@redhat.com
      Fixes: 71994620 ("virtio_balloon: replace oom notifier with shrinker")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarTyler Sanderson <tysand@google.com>
      Tested-by: default avatarTyler Sanderson <tysand@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da10329c
    • Alexander Duyck's avatar
      mm/page_reporting: add free page reporting documentation · 1edca85e
      Alexander Duyck authored
      Add documentation for free page reporting.  Currently the only consumer is
      virtio-balloon, however it is possible that other drivers might make use
      of this so it is best to add a bit of documetation explaining at a high
      level how to use the API.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224730.29318.43815.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1edca85e
    • Alexander Duyck's avatar
      mm/page_reporting: add budget limit on how many pages can be reported per pass · 43b76f29
      Alexander Duyck authored
      In order to keep ourselves from reporting pages that are just going to be
      reused again in the case of heavy churn we can put a limit on how many
      total pages we will process per pass.  Doing this will allow the worker
      thread to go into idle much more quickly so that we avoid competing with
      other threads that might be allocating or freeing pages.
      
      The logic added here will limit the worker thread to no more than one
      sixteenth of the total free pages in a given area per list.  Once that
      limit is reached it will update the state so that at the end of the pass
      we will reschedule the worker to try again in 2 seconds when the memory
      churn has hopefully settled down.
      
      Again this optimization doesn't show much of a benefit in the standard
      case as the memory churn is minmal.  However with page allocator shuffling
      enabled the gain is quite noticeable.  Below are the results with a THP
      enabled version of the will-it-scale page_fault1 test showing the
      improvement in iterations for 16 processes or threads.
      
      Without:
      tasks   processes       processes_idle  threads         threads_idle
      16      8283274.75      0.17            5594261.00      38.15
      
      With:
      tasks   processes       processes_idle  threads         threads_idle
      16      8767010.50      0.21            5791312.75      36.98
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b76f29
    • Alexander Duyck's avatar
      mm/page_reporting: rotate reported pages to the tail of the list · 02cf8719
      Alexander Duyck authored
      Rather than walking over the same pages again and again to get to the
      pages that have yet to be reported we can save ourselves a significant
      amount of time by simply rotating the list so that when we have a full
      list of reported pages the head of the list is pointing to the next
      non-reported page.  Doing this should save us some significant time when
      processing each free list.
      
      This doesn't gain us much in the standard case as all of the non-reported
      pages should be near the top of the list already.  However in the case of
      page shuffling this results in a noticeable improvement.  Below are the
      will-it-scale page_fault1 w/ THP numbers for 16 tasks with and without
      this patch.
      
      Without:
      tasks   processes       processes_idle  threads         threads_idle
      16      8093776.25      0.17            5393242.00      38.20
      
      With:
      tasks   processes       processes_idle  threads         threads_idle
      16      8283274.75      0.17            5594261.00      38.15
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224708.29318.16862.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02cf8719
    • Alexander Duyck's avatar
      virtio-balloon: add support for providing free page reports to host · b0c504f1
      Alexander Duyck authored
      Add support for the page reporting feature provided by virtio-balloon.
      Reporting differs from the regular balloon functionality in that is is
      much less durable than a standard memory balloon.  Instead of creating a
      list of pages that cannot be accessed the pages are only inaccessible
      while they are being indicated to the virtio interface.  Once the
      interface has acknowledged them they are placed back into their respective
      free lists and are once again accessible by the guest system.
      
      Unlike a standard balloon we don't inflate and deflate the pages.  Instead
      we perform the reporting, and once the reporting is completed it is
      assumed that the page has been dropped from the guest and will be faulted
      back in the next time the page is accessed.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0c504f1
    • Alexander Duyck's avatar
      virtio-balloon: pull page poisoning config out of free page hinting · d74b78fa
      Alexander Duyck authored
      Currently the page poisoning setting wasn't being enabled unless free page
      hinting was enabled.  However we will need the page poisoning tracking
      logic as well for free page reporting.  As such pull it out and make it a
      separate bit of config in the probe function.
      
      In addition we need to add support for the more recent init_on_free
      feature which expects a behavior similar to page poisoning in that we
      expect the page to be pre-zeroed.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224646.29318.695.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d74b78fa
    • Alexander Duyck's avatar
      mm: introduce Reported pages · 36e66c55
      Alexander Duyck authored
      In order to pave the way for free page reporting in virtualized
      environments we will need a way to get pages out of the free lists and
      identify those pages after they have been returned.  To accomplish this,
      this patch adds the concept of a Reported Buddy, which is essentially
      meant to just be the Uptodate flag used in conjunction with the Buddy page
      type.
      
      To prevent the reported pages from leaking outside of the buddy lists I
      added a check to clear the PageReported bit in the del_page_from_free_list
      function.  As a result any reported page that is split, merged, or
      allocated will have the flag cleared prior to the PageBuddy value being
      cleared.
      
      The process for reporting pages is fairly simple.  Once we free a page
      that meets the minimum order for page reporting we will schedule a worker
      thread to start 2s or more in the future.  That worker thread will begin
      working from the lowest supported page reporting order up to MAX_ORDER - 1
      pulling unreported pages from the free list and storing them in the
      scatterlist.
      
      When processing each individual free list it is necessary for the worker
      thread to release the zone lock when it needs to stop and report the full
      scatterlist of pages.  To reduce the work of the next iteration the worker
      thread will rotate the free list so that the first unreported page in the
      free list becomes the first entry in the list.
      
      It will then call a reporting function providing information on how many
      entries are in the scatterlist.  Once the function completes it will
      return the pages to the free area from which they were allocated and start
      over pulling more pages from the free areas until there are no longer
      enough pages to report on to keep the worker busy, or we have processed as
      many pages as were contained in the free area when we started processing
      the list.
      
      The worker thread will work in a round-robin fashion making its way though
      each zone requesting reporting, and through each reportable free list
      within that zone.  Once all free areas within the zone have been processed
      it will check to see if there have been any requests for reporting while
      it was processing.  If so it will reschedule the worker thread to start up
      again in roughly 2s and exit.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36e66c55
    • Alexander Duyck's avatar
      mm: add function __putback_isolated_page · 624f58d8
      Alexander Duyck authored
      There are cases where we would benefit from avoiding having to go through
      the allocation and free cycle to return an isolated page.
      
      Examples for this might include page poisoning in which we isolate a page
      and then put it back in the free list without ever having actually
      allocated it.
      
      This will enable us to also avoid notifiers for the future free page
      reporting which will need to avoid retriggering page reporting when
      returning pages that have been reported on.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      624f58d8
    • Alexander Duyck's avatar
      mm: use zone and order instead of free area in free_list manipulators · 6ab01363
      Alexander Duyck authored
      In order to enable the use of the zone from the list manipulator functions
      I will need access to the zone pointer.  As it turns out most of the
      accessors were always just being directly passed &zone->free_area[order]
      anyway so it would make sense to just fold that into the function itself
      and pass the zone and order as arguments instead of the free area.
      
      In order to be able to reference the zone we need to move the declaration
      of the functions down so that we have the zone defined before we define
      the list manipulation functions.  Since the functions are only used in the
      file mm/page_alloc.c we can just move them there to reduce noise in the
      header.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ab01363
    • Alexander Duyck's avatar
      mm: adjust shuffle code to allow for future coalescing · a2129f24
      Alexander Duyck authored
      Patch series "mm / virtio: Provide support for free page reporting", v17.
      
      This series provides an asynchronous means of reporting free guest pages
      to a hypervisor so that the memory associated with those pages can be
      dropped and reused by other processes and/or guests on the host.  Using
      this it is possible to avoid unnecessary I/O to disk and greatly improve
      performance in the case of memory overcommit on the host.
      
      When enabled we will be performing a scan of free memory every 2 seconds
      while pages of sufficiently high order are being freed.  In each pass at
      least one sixteenth of each free list will be reported.  By doing this we
      avoid racing against other threads that may be causing a high amount of
      memory churn.
      
      The lowest page order currently scanned when reporting pages is
      pageblock_order so that this feature will not interfere with the use of
      Transparent Huge Pages in the case of virtualization.
      
      Currently this is only in use by virtio-balloon however there is the hope
      that at some point in the future other hypervisors might be able to make
      use of it.  In the virtio-balloon/QEMU implementation the hypervisor is
      currently using MADV_DONTNEED to indicate to the host kernel that the page
      is currently free.  It will be zeroed and faulted back into the guest the
      next time the page is accessed.
      
      To track if a page is reported or not the Uptodate flag was repurposed and
      used as a Reported flag for Buddy pages.  We walk though the free list
      isolating pages and adding them to the scatterlist until we either
      encounter the end of the list or have processed at least one sixteenth of
      the pages that were listed in nr_free prior to us starting.  If we fill
      the scatterlist before we reach the end of the list we rotate the list so
      that the first unreported page we encounter is moved to the head of the
      list as that is where we will resume after we have freed the reported
      pages back into the tail of the list.
      
      Below are the results from various benchmarks.  I primarily focused on two
      tests.  The first is the will-it-scale/page_fault2 test, and the other is
      a modified version of will-it-scale/page_fault1 that was enabled to use
      THP.  I did this as it allows for better visibility into different parts
      of the memory subsystem.  The guest is running with 32G for RAM on one
      node of a E5-2630 v3.  The host has had some features such as CPU turbo
      disabled in the BIOS.
      
      Test                   page_fault1 (THP)    page_fault2
      Name            tasks  Process Iter  STDEV  Process Iter  STDEV
      Baseline            1    1012402.50  0.14%     361855.25  0.81%
                         16    8827457.25  0.09%    3282347.00  0.34%
      
      Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
                         16    8784741.75  0.39%    3240669.25  0.48%
      
      Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
                         16    8756219.00  0.24%    3226608.75  0.97%
      
      Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
       page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
      
      Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
       shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
      
      The results above are for a baseline with a linux-next-20191219 kernel,
      that kernel with this patch set applied but page reporting disabled in
      virtio-balloon, the patches applied and page reporting fully enabled, the
      patches enabled with page shuffling enabled, and the patches applied with
      page shuffling enabled and an RFC patch that makes used of MADV_FREE in
      QEMU.  These results include the deviation seen between the average value
      reported here versus the high and/or low value.  I observed that during
      the test memory usage for the first three tests never dropped whereas with
      the patches fully enabled the VM would drop to using only a few GB of the
      host's memory when switching from memhog to page fault tests.
      
      Any of the overhead visible with this patch set enabled seems due to page
      faults caused by accessing the reported pages and the host zeroing the
      page before giving it back to the guest.  This overhead is much more
      visible when using THP than with standard 4K pages.  In addition page
      shuffling seemed to increase the amount of faults generated due to an
      increase in memory churn.  The overehad is reduced when using MADV_FREE as
      we can avoid the extra zeroing of the pages when they are reintroduced to
      the host, as can be seen when the RFC is applied with shuffling enabled.
      
      The overall guest size is kept fairly small to only a few GB while the
      test is running.  If the host memory were oversubscribed this patch set
      should result in a performance improvement as swapping memory in the host
      can be avoided.
      
      A brief history on the background of free page reporting can be found at:
      https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/
      
      This patch (of 9):
      
      Move the head/tail adding logic out of the shuffle code and into the
      __free_one_page function since ultimately that is where it is really
      needed anyway.  By doing this we should be able to reduce the overhead and
      can consolidate all of the list addition bits in one spot.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomainSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2129f24
    • Huang Ying's avatar
      mm: code cleanup for MADV_FREE · 9de4f22a
      Huang Ying authored
      Some comments for MADV_FREE is revised and added to help people understand
      the MADV_FREE code, especially the page flag, PG_swapbacked.  This makes
      page_is_file_cache() isn't consistent with its comments.  So the function
      is renamed to page_is_file_lru() to make them consistent again.  All these
      are put in one patch as one logical change.
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9de4f22a
    • Li Chen's avatar
      7a9547fd
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE · 396bcc52
      Matthew Wilcox (Oracle) authored
      Commit e496cf3d ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
      notes that it should be reverted when the PowerPC problem was fixed.  The
      commit fixing the PowerPC problem (953c66c2) did not revert the
      commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
      CONFIG_TRANSPARENT_HUGEPAGE.  Checking with Kirill and Aneesh, this was an
      oversight, so remove the Kconfig symbol and undo the work of commit
      e496cf3d.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      396bcc52
    • Matthew Wilcox (Oracle)'s avatar
      include/linux/pagemap.h: optimise find_subpage for !THP · a0650604
      Matthew Wilcox (Oracle) authored
      If THP is disabled, find_subpage() can become a no-op by using
      hpage_nr_pages() instead of compound_nr().  hpage_nr_pages() embeds a
      check for PageTail, so we can drop the check here.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-5-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0650604
    • David Rientjes's avatar
      mm, thp: track fallbacks due to failed memcg charges separately · 85b9f46e
      David Rientjes authored
      The thp_fault_fallback and thp_file_fallback vmstats are incremented if
      either the hugepage allocation fails through the page allocator or the
      hugepage charge fails through mem cgroup.
      
      This patch leaves this field untouched but adds two new fields,
      thp_{fault,file}_fallback_charge, which is incremented only when the mem
      cgroup charge fails.
      
      This distinguishes between attempted hugepage allocations that fail due to
      fragmentation (or low memory conditions) and those that fail due to mem
      cgroup limits.  That can be used to determine the impact of fragmentation
      on the system by excluding faults that failed due to memcg usage.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jeremy Cline <jcline@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85b9f46e
    • David Rientjes's avatar
      mm, shmem: add vmstat for hugepage fallback · dcdf11ee
      David Rientjes authored
      The existing thp_fault_fallback indicates when thp attempts to allocate a
      hugepage but fails, or if the hugepage cannot be charged to the mem cgroup
      hierarchy.
      
      Extend this to shmem as well.  Adds a new thp_file_fallback to complement
      thp_file_alloc that gets incremented when a hugepage is attempted to be
      allocated but fails, or if it cannot be charged to the mem cgroup
      hierarchy.
      
      Additionally, remove the check for CONFIG_TRANSPARENT_HUGE_PAGECACHE from
      shmem_alloc_hugepage() since it is only called with this configuration
      option.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jeremy Cline <jcline@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061421240.7412@chino.kir.corp.google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcdf11ee
    • Yang Shi's avatar
      mm/migrate.c: migrate PG_readahead flag · 6aeff241
      Yang Shi authored
      Currently the migration code doesn't migrate PG_readahead flag.
      Theoretically this would incur slight performance loss as the application
      might have to ramp its readahead back up again.  Even though such problem
      happens, it might be hidden by something else since migration is typically
      triggered by compaction and NUMA balancing, any of which should be more
      noticeable.
      
      Migrate the flag after end_page_writeback() since it may clear PG_reclaim
      flag, which is the same bit as PG_readahead, for the new page.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/1581640185-95731-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6aeff241
    • Wei Yang's avatar
      mm/migrate.c: unify "not queued for migration" handling in do_pages_move() · d08221a0
      Wei Yang authored
      It can currently happen that we store the status of a page twice:
      * Once we detect that it is already on the target node
      * Once we moved a bunch of pages, and a page that's already on the
        target node is contained in the current interval.
      
      Let's simplify the code and always call do_move_pages_to_node() in case we
      did not queue a page for migration.  Note that pages that are already on
      the target node are not added to the pagelist and are, therefore, ignored
      by do_move_pages_to_node() - there is no functional change.
      
      The status of such a page is now only stored once.
      
      [david@redhat.com rephrase changelog]
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200214003017.25558-5-richardw.yang@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d08221a0
    • Wei Yang's avatar
      mm/migrate.c: check pagelist in move_pages_and_store_status() · 5d7ae891
      Wei Yang authored
      When pagelist is empty, it is not necessary to do the move and store.
      Also it consolidate the empty list check in one place.
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200214003017.25558-4-richardw.yang@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d7ae891
    • Wei Yang's avatar
      mm/migrate.c: wrap do_move_pages_to_node() and store_status() · 7ca8783a
      Wei Yang authored
      Usually, do_move_pages_to_node() and store_status() are used in
      combination.  We have three similar call sites.
      
      Let's provide a wrapper for both function calls -
      move_pages_and_store_status - to make the calling code easier to maintain
      and fix (as noted by Yang Shi, the return value handling of
      do_move_pages_to_node() has a flaw).
      
      [david@redhat.com rephrase changelog]
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200214003017.25558-3-richardw.yang@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ca8783a
    • Wei Yang's avatar
      mm/migrate.c: no need to check for i > start in do_pages_move() · 4afdacec
      Wei Yang authored
      Patch series "cleanup on do_pages_move()", v5.
      
      The logic in do_pages_move() is a little mess for audience to read and has
      some potential error on handling the return value. Especially there are
      three calls on do_move_pages_to_node() and store_status() with almost the
      same form.
      
      This patch set tries to make the code a little friendly for audience by
      consolidate the calls.
      
      This patch (of 4):
      
      At this point, we always have i >= start.  If i == start, store_status()
      will return 0.  So we can drop the check for i > start.
      
      [david@redhat.com rephrase changelog]
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200214003017.25558-2-richardw.yang@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4afdacec
    • Michal Hocko's avatar
      mm: make it clear that gfp reclaim modifiers are valid only for sleepable allocations · 29fd1897
      Michal Hocko authored
      While it might be really clear to MM developers that gfp reclaim modifiers
      are applicable only to sleepable allocations (those with
      __GFP_DIRECT_RECLAIM) it seems that actual users of the API are not always
      sure.  Make it explicit that they are not applicable for GFP_NOWAIT or
      GFP_ATOMIC allocations which are the most commonly used non-sleepable
      allocation masks.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Neil Brown <neilb@suse.de>
      Link: http://lkml.kernel.org/r/20200403083543.11552-3-mhocko@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29fd1897
    • Qiujun Huang's avatar
      mm/vmalloc: fix a typo in comment · d8cc323d
      Qiujun Huang authored
      There is a typo in comment, fix it.
      "exeeds" -> "exceeds"
      Signed-off-by: default avatarQiujun Huang <hqjagain@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200404060136.10838-1-hqjagain@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8cc323d
    • Anshuman Khandual's avatar
      mm/vma: append unlikely() while testing VMA access permissions · 5093c587
      Anshuman Khandual authored
      It is unlikely that an inaccessible VMA without required permission flags
      will get a page fault.  Hence lets just append unlikely() directive to
      such checks in order to improve performance while also standardizing it
      across various platforms.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Link: http://lkml.kernel.org/r/1582525304-32113-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5093c587
    • Anshuman Khandual's avatar
      mm/vma: replace all remaining open encodings with vma_is_anonymous() · a0137f16
      Anshuman Khandual authored
      This replaces all remaining open encodings with vma_is_anonymous().
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Vlastimil Babka <vbabka@suse.cz
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/1582520593-30704-5-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0137f16
    • Anshuman Khandual's avatar
      mm/vma: replace all remaining open encodings with is_vm_hugetlb_page() · 03911132
      Anshuman Khandual authored
      This replaces all remaining open encodings with is_vm_hugetlb_page().
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03911132
    • Anshuman Khandual's avatar
      mm/vma: make vma_is_accessible() available for general use · 3122e80e
      Anshuman Khandual authored
      Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
      available for general use.  While here, this replaces all remaining open
      encodings for VMA access check with vma_is_accessible().
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarGuo Ren <guoren@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3122e80e
    • Anshuman Khandual's avatar
      mm/vma: add missing VMA flag readable name for VM_SYNC · 7e96fb57
      Anshuman Khandual authored
      Patch series "mm/vma: Use all available wrappers when possible", v2.
      
      Apart from adding a VMA flag readable name for trace purpose, this series
      does some open encoding replacements with availabe VMA specific wrappers.
      This skips VM_HUGETLB check in vma_migratable() as its already being done
      with another patch (https://patchwork.kernel.org/patch/11347831/) which is
      yet to be merged.
      
      This patch (of 4):
      
      This just adds the missing readable name for VM_SYNC.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/1582520593-30704-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e96fb57
    • Li Xinhai's avatar
      mm: set vm_next and vm_prev to NULL in vm_area_dup() · e39a4b33
      Li Xinhai authored
      Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the
      new duplicated vma.
      
      Currently, only in fork path there are misuse for handling anon_vma.  No
      other bugs been revealed with this patch applied.
      Signed-off-by: default avatarLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e39a4b33
    • Li Xinhai's avatar
      Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork" · 23ab76bf
      Li Xinhai authored
      This reverts commit 4e4a9eb9 ("mm/rmap.c: reuse mergeable
      anon_vma as parent when fork").
      
      In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and
      parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
      ->vm_prev as its parent vma.  That causes the anon_vma used by parent been
      mistakenly shared by child (In anon_vma_clone(), the code added by that
      commit will do this reuse work).
      
      Besides this issue, the design of reusing anon_vma from vma which has gone
      through fork should be avoided ([1]).  So, this patch reverts that commit
      and maintains the consistent logic of reusing anon_vma for
      fork/split/merge vma.
      
      Reusing anon_vma within the process is fine.  But if a vma has gone
      through fork(), then that vma's anon_vma should not be shared with its
      neighbor vma.  As explained in [1], when vma gone through fork(), the
      check for list_is_singular(vma->anon_vma_chain) will be false, and
      don't share anon_vma.
      
      With current issue, one example can clarify more.  Parent process do
      below two steps:
      
      1. p_vma_1 is created and p_anon_vma_1 is prepared;
      
      2. p_vma_2 is created and share p_anon_vma_1; (this is allowed,
         becaues p_vma_1 didn't gothrough fork()); parent process do fork():
      
      3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1
         prepared; at this point, c_vma_1->anon_vma_chain has two items, one
         for p_anon_vma_1 and one for c_anon_vma_1;
      
      4. c_vma_2 is dup from p_vma_2, it is not allowed to share
         c_anon_vma_1, because
      
      c_vma_1->anon_vma_chain has two items.
      [1] commit d0e9fe17 ("Simplify and comment on anon_vma re-use for
          anon_vma_prepare()") explains the test of "list_is_singular()".
      
      Fixes: 4e4a9eb9 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork")
      Signed-off-by: default avatarLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23ab76bf
    • Li Xinhai's avatar
      mm: don't prepare anon_vma if vma has VM_WIPEONFORK · 93949bb2
      Li Xinhai authored
      Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path".
      
      This patchset fixes the misuse of parenet anon_vma, which mainly caused by
      child vma's vm_next and vm_prev are left same as its parent after
      duplicate vma.  Finally, code reached parent vma's neighbor by referring
      pointer of child vma and executed wrong logic.
      
      The first two patches fix relevant issues, and the third patch sets
      vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse
      in future.
      
      Effects of the first bug is that causes rmap code to check both parent and
      child's page table, although a page couldn't be mapped by both parent and
      child, because child vma has WIPEONFORK so all pages mapped by child are
      'new' and not relevant to parent.
      
      Effects of the second bug is that the relationship of anon_vma of parent
      and child are totallyconvoluted.  It would cause 'son', 'grandson', ...,
      etc, to share 'parent' anon_vma, which disobey the design rule of reusing
      anon_vma (the rule to be followed is that reusing should among vma of same
      process, and vma should not gone through fork).
      
      So, both issues should cause unnecessary rmap walking and have unexpected
      complexity.
      
      These two issues would not be directly visible, I used debugging code to
      check the anon_vma pointers of parent and child when inspecting the
      suspicious implementation of issue #2, then find the problem.
      
      This patch (of 3):
      
      In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and
      parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
      ->vm_prev as its parent vma.  That allows anon_vma used by parent been
      mistakenly shared by child (find_mergeable_anon_vma() will do this reuse
      work).
      
      Besides this issue, call anon_vma_prepare() should be avoided because we
      don't copy page for this vma.  Preparing anon_vma will be handled during
      fault.
      
      Fixes: d2cd9ede ("mm,fork: introduce MADV_WIPEONFORK")
      Signed-off-by: default avatarLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1581150928-3214-2-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93949bb2