An error occurred fetching the project authors.
  1. 20 Dec, 2023 1 commit
    • Ryan Roberts's avatar
      mm: thp: introduce multi-size THP sysfs interface · 3485b883
      Ryan Roberts authored
      In preparation for adding support for anonymous multi-size THP, introduce
      new sysfs structure that will be used to control the new behaviours.  A
      new directory is added under transparent_hugepage for each supported THP
      size, and contains an `enabled` file, which can be set to "inherit" (to
      inherit the global setting), "always", "madvise" or "never".  For now, the
      kernel still only supports PMD-sized anonymous THP, so only 1 directory is
      populated.
      
      The first half of the change converts transhuge_vma_suitable() and
      hugepage_vma_check() so that they take a bitfield of orders for which the
      user wants to determine support, and the functions filter out all the
      orders that can't be supported, given the current sysfs configuration and
      the VMA dimensions.  The resulting functions are renamed to
      thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively. 
      Convenience functions that take a single, unencoded order and return a
      boolean are also defined as thp_vma_suitable_order() and
      thp_vma_allowable_order().
      
      The second half of the change implements the new sysfs interface.  It has
      been done so that each supported THP size has a `struct thpsize`, which
      describes the relevant metadata and is itself a kobject.  This is pretty
      minimal for now, but should make it easy to add new per-thpsize files to
      the interface if needed in future (e.g.  per-size defrag).  Rather than
      keep the `enabled` state directly in the struct thpsize, I've elected to
      directly encode it into huge_anon_orders_[always|madvise|inherit]
      bitfields since this reduces the amount of work required in
      thp_vma_allowable_orders() which is called for every page fault.
      
      See Documentation/admin-guide/mm/transhuge.rst, as modified by this
      commit, for details of how the new sysfs interface works.
      
      [ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
        Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3485b883
  2. 12 Dec, 2023 4 commits
  3. 11 Dec, 2023 3 commits
    • Fabio M. De Francesco's avatar
      mm/memory: use kmap_local_page() in __wp_page_copy_user() · 24d2613a
      Fabio M. De Francesco authored
      kmap_atomic() has been deprecated in favor of kmap_local_{folio,page}.
      
      Therefore, replace kmap_atomic() with kmap_local_page in
      __wp_page_copy_user().
      
      kmap_atomic() disables preemption in !PREEMPT_RT kernels and
      unconditionally disables also page-faults.  My limited knowledge of the
      implementation of __wp_page_copy_user() makes me think that the latter
      side effect is still needed here, but kmap_local_page() is implemented not
      to disable page-faults.
      
      So, in addition to the conversion to local mapping, add explicit
      pagefault_disable() / pagefault_enable() between mapping and un-mapping.
      
      Link: https://lkml.kernel.org/r/20231120142418.6977-1-fmdefrancesco@gmail.comSigned-off-by: default avatarFabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24d2613a
    • Matthew Wilcox (Oracle)'s avatar
      mm: convert __do_fault() to use a folio · 01d1e0e6
      Matthew Wilcox (Oracle) authored
      Convert vmf->page to a folio as soon as we're going to use it.  This fixes
      a bug if the fault handler returns a tail page with hardware poison; tail
      pages have an invalid page->index, so we would fail to unmap the page from
      the page tables.  We actually have to unmap the entire folio (or
      mapping_evict_folio() will fail), so use unmap_mapping_folio() instead.
      
      This also saves various calls to compound_head() hidden in lock_page(),
      put_page(), etc.
      
      Link: https://lkml.kernel.org/r/20231108182809.602073-3-willy@infradead.org
      Fixes: 793917d9 ("mm/readahead: Add large folio readahead")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      01d1e0e6
    • Peng Zhang's avatar
      fork: use __mt_dup() to duplicate maple tree in dup_mmap() · d2406291
      Peng Zhang authored
      In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
      directly replacing the entries of VMAs in the new maple tree can result in
      better performance.  __mt_dup() uses DFS pre-order to duplicate the maple
      tree, so it is efficient.
      
      The average time complexity of __mt_dup() is O(n), where n is the number
      of VMAs.  The proof of the time complexity is provided in the commit log
      that introduces __mt_dup().  After duplicating the maple tree, each
      element is traversed and replaced (ignoring the cases of deletion, which
      are rare).  Since it is only a replacement operation for each element,
      this process is also O(n).
      
      Analyzing the exact time complexity of the previous algorithm is
      challenging because each insertion can involve appending to a node,
      pushing data to adjacent nodes, or even splitting nodes.  The frequency of
      each action is difficult to calculate.  The worst-case scenario for a
      single insertion is when the tree undergoes splitting at every level.  If
      we consider each insertion as the worst-case scenario, we can determine
      that the upper bound of the time complexity is O(n*log(n)), although this
      is a loose upper bound.  However, based on the test data, it appears that
      the actual time complexity is likely to be O(n).
      
      As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
      fails, there will be a portion of VMAs that have not been duplicated in
      the maple tree.  To handle this, we mark the failure point with
      XA_ZERO_ENTRY.  In exit_mmap(), if this marker is encountered, stop
      releasing VMAs that have not been duplicated after this point.
      
      There is a "spawn" in byte-unixbench[1], which can be used to test the
      performance of fork().  I modified it slightly to make it work with
      different number of VMAs.
      
      Below are the test results.  The first row shows the number of VMAs.  The
      second and third rows show the number of fork() calls per ten seconds,
      corresponding to next-20231006 and the this patchset, respectively.  The
      test results were obtained with CPU binding to avoid scheduler load
      balancing that could cause unstable results.  There are still some
      fluctuations in the test results, but at least they are better than the
      original performance.
      
      21     121   221    421    821    1621   3221   6421   12821  25621  51221
      112100 76261 54227  34035  20195  11112  6017   3161   1606   802    393
      114558 83067 65008  45824  28751  16072  8922   4747   2436   1233   599
      2.19%  8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
      
      [1] https://github.com/kdlucas/byte-unixbench/tree/master
      
      Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Suggested-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Mike Christie <michael.christie@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2406291
  4. 07 Dec, 2023 1 commit
  5. 25 Oct, 2023 4 commits
  6. 18 Oct, 2023 12 commits
    • Lorenzo Stoakes's avatar
      mm/gup: adapt get_user_page_vma_remote() to never return NULL · 6a1960b8
      Lorenzo Stoakes authored
      get_user_pages_remote() will never return 0 except in the case of
      FOLL_NOWAIT being specified, which we explicitly disallow.
      
      This simplifies error handling for the caller and avoids the awkwardness
      of dealing with both errors and failing to pin.  Failing to pin here is an
      error.
      
      Link: https://lkml.kernel.org/r/00319ce292d27b3aae76a0eb220ce3f528187508.1696288092.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a1960b8
    • Lorenzo Stoakes's avatar
      mm: make __access_remote_vm() static · c43cfa42
      Lorenzo Stoakes authored
      Patch series "various improvements to the GUP interface", v2.
      
      A series of fixes to simplify and improve the GUP interface with an eye to
      providing groundwork to future improvements:-
      
      * __access_remote_vm() and access_remote_vm() are functionally identical,
        so make the former static such that in future we can potentially change
        the external-facing implementation details of this function.
      
      * Extend is_valid_gup_args() to cover the missing FOLL_TOUCH case, and
        simplify things by defining INTERNAL_GUP_FLAGS to check against.
      
      * Adjust __get_user_pages_locked() to explicitly treat a failure to pin any
        pages as an error in all circumstances other than FOLL_NOWAIT being
        specified, bringing it in line with the nommu implementation of this
        function.
      
      * (With many thanks to Arnd who suggested this in the first instance)
        Update get_user_page_vma_remote() to explicitly only return a page or an
        error, simplifying the interface and avoiding the questionable
        IS_ERR_OR_NULL() pattern.
      
      
      This patch (of 4):
      
      access_remote_vm() passes through parameters to __access_remote_vm()
      directly, so remove the __access_remote_vm() function from mm.h and use
      access_remote_vm() in the one caller that needs it (ptrace_access_vm()).
      
      This allows future adjustments to the GUP-internal __access_remote_vm()
      function while keeping the access_remote_vm() function stable.
      
      Link: https://lkml.kernel.org/r/cover.1696288092.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/f7877c5039ce1c202a514a8aeeefc5cdd5e32d19.1696288092.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c43cfa42
    • David Hildenbrand's avatar
      memory: move exclusivity detection in do_wp_page() into wp_can_reuse_anon_folio() · dec078cc
      David Hildenbrand authored
      Let's clean up do_wp_page() a bit, removing two labels and making it a
      easier to read.
      
      wp_can_reuse_anon_folio() now only operates on the whole folio.  Move the
      SetPageAnonExclusive() out into do_wp_page().  No need to do this under
      page lock -- the page table lock is sufficient.
      
      Link: https://lkml.kernel.org/r/20231002142949.235104-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dec078cc
    • David Hildenbrand's avatar
      mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap() · 06968625
      David Hildenbrand authored
      Let's convert it to consume a folio.
      
      [akpm@linux-foundation.org: fix kerneldoc]
      Link: https://lkml.kernel.org/r/20231002142949.235104-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06968625
    • David Hildenbrand's avatar
      mm/rmap: move SetPageAnonExclusive() out of page_move_anon_rmap() · 5ca43289
      David Hildenbrand authored
      Patch series "mm/rmap: convert page_move_anon_rmap() to
      folio_move_anon_rmap()".
      
      Convert page_move_anon_rmap() to folio_move_anon_rmap(), letting the
      callers handle PageAnonExclusive.  I'm including cleanup patch #3 because
      it fits into the picture and can be done cleaner by the conversion.
      
      
      This patch (of 3):
      
      Let's move it into the caller: there is a difference between whether an
      anon folio can only be mapped by one process (e.g., into one VMA), and
      whether it is truly exclusive (e.g., no references -- including GUP --
      from other processes).
      
      Further, for large folios the page might not actually be pointing at the
      head page of the folio, so it better be handled in the caller.  This is a
      preparation for converting page_move_anon_rmap() to consume a folio.
      
      Link: https://lkml.kernel.org/r/20231002142949.235104-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20231002142949.235104-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5ca43289
    • Matthew Wilcox (Oracle)'s avatar
      mm: handle write faults to RO pages under the VMA lock · 4a68fef1
      Matthew Wilcox (Oracle) authored
      I think this is a pretty rare occurrence, but for consistency handle
      faults with the VMA lock held the same way that we handle other faults
      with the VMA lock held.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a68fef1
    • Matthew Wilcox (Oracle)'s avatar
      mm: handle read faults under the VMA lock · 12214eba
      Matthew Wilcox (Oracle) authored
      Most file-backed faults are already handled through ->map_pages(), but if
      we need to do I/O we'll come this way.  Since filemap_fault() is now safe
      to be called under the VMA lock, we can handle these faults under the VMA
      lock now.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      12214eba
    • Matthew Wilcox (Oracle)'s avatar
      mm: handle COW faults under the VMA lock · 4de8c93a
      Matthew Wilcox (Oracle) authored
      If the page is not currently present in the page tables, we need to call
      the page fault handler to find out which page we're supposed to COW, so we
      need to both check that there is already an anon_vma and that the fault
      handler doesn't need the mmap_lock.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4de8c93a
    • Matthew Wilcox (Oracle)'s avatar
      mm: handle shared faults under the VMA lock · 4ed43798
      Matthew Wilcox (Oracle) authored
      There are many implementations of ->fault and some of them depend on
      mmap_lock being held.  All vm_ops that implement ->map_pages() end up
      calling filemap_fault(), which I have audited to be sure it does not rely
      on mmap_lock.  So (for now) key off ->map_pages existing as a flag to
      indicate that it's safe to call ->fault while only holding the vma lock.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4ed43798
    • Matthew Wilcox (Oracle)'s avatar
      mm: call wp_page_copy() under the VMA lock · 164b06f2
      Matthew Wilcox (Oracle) authored
      It is usually safe to call wp_page_copy() under the VMA lock.  The only
      unsafe situation is when no anon_vma has been allocated for this VMA, and
      we have to look at adjacent VMAs to determine if their anon_vma can be
      shared.  Since this happens only for the first COW of a page in this VMA,
      the majority of calls to wp_page_copy() do not need to fall back to the
      mmap_sem.
      
      Add vmf_anon_prepare() as an alternative to anon_vma_prepare() which will
      return RETRY if we currently hold the VMA lock and need to allocate an
      anon_vma.  This lets us drop the check in do_wp_page().
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      164b06f2
    • Peter Xu's avatar
      userfaultfd: UFFD_FEATURE_WP_ASYNC · d61ea1cb
      Peter Xu authored
      Patch series "Implement IOCTL to get and optionally clear info about
      PTEs", v33.
      
      *Motivation*
      The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
      GetWriteWatch() and ResetWriteWatch() syscalls [1].  The GetWriteWatch()
      retrieves the addresses of the pages that are written to in a region of
      virtual memory.
      
      This syscall is used in Windows applications and games etc.  This syscall
      is being emulated in pretty slow manner in userspace.  Our purpose is to
      enhance the kernel such that we translate it efficiently in a better way. 
      Currently some out of tree hack patches are being used to efficiently
      emulate it in some kernels.  We intend to replace those with these
      patches.  So the whole gaming on Linux can effectively get benefit from
      this.  It means there would be tons of users of this code.
      
      CRIU use case [2] was mentioned by Andrei and Danylo:
      > Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
      > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
      > shadow memory [4]. Being able to migrate such binaries allows to highly
      > reduce the amount of work needed to identify and fix post-migration
      > crashes, which happen constantly.
      
      Andrei defines the following uses of this code:
      * it is more granular and allows us to track changed pages more
        effectively. The current interface can clear dirty bits for the entire
        process only. In addition, reading info about pages is a separate
        operation. It means we must freeze the process to read information
        about all its pages, reset dirty bits, only then we can start dumping
        pages. The information about pages becomes more and more outdated,
        while we are processing pages. The new interface solves both these
        downsides. First, it allows us to read pte bits and clear the
        soft-dirty bit atomically. It means that CRIU will not need to freeze
        processes to pre-dump their memory. Second, it clears soft-dirty bits
        for a specified region of memory. It means CRIU will have actual info
        about pages to the moment of dumping them.
      * The new interface has to be much faster because basic page filtering
        is happening in the kernel. With the old interface, we have to read
        pagemap for each page.
      
      *Implementation Evolution (Short Summary)*
      From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
      feature can be used under the hood with some additions like:
      * reset soft-dirty flag for only a specific region of memory instead of
      clearing the flag for the entire process
      * get and clear soft-dirty flag for a specific region atomically
      
      So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
      flag. But using soft-dirty flag, sometimes we get extra pages which weren't
      even written. They had become soft-dirty because of VMA merging and
      VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
      able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
      reported that mprotect etc messes up the soft-dirty flag while ignoring
      VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
      discussed if we can revert these patches. But we could not reach to any
      conclusion. So at this point, I made couple of tries to solve this whole
      VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
      * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
      regression. We left it behind.
      * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
      got the reply don't increase the size of the VMA by 8 bytes.
      
      At this point, we left soft-dirty considering it is too much delicate and
      userfaultfd [9] seemed like the only way forward. From there onward, we
      have been basing soft-dirty emulation on userfaultfd wp feature where
      kernel resolves the faults itself when WP_ASYNC feature is used. It was
      straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
      those pages dirty or written-to which are really written in reality. (PS
      There is another WP_UNPOPULATED userfautfd feature is required which is
      needed to avoid pre-faulting memory before write-protecting [9].)
      
      All the different masks were added on the request of CRIU devs to create
      interface more generic and better.
      
      [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
      [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
      [3] https://github.com/google/sanitizers
      [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
      [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
      [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
      [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
      [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
      [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
      [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
      
      
      This patch (of 6):
      
      Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
      userfaultfd wr-protect faults to be resolved by the kernel directly.
      
      It can be used like a high accuracy version of soft-dirty, without vma
      modifications during tracking, and also with ranged support by default
      rather than for a whole mm when reset the protections due to existence of
      ioctl(UFFDIO_WRITEPROTECT).
      
      Several goals of such a dirty tracking interface:
      
      1. All types of memory should be supported and tracable. This is nature
         for soft-dirty but should mention when the context is userfaultfd,
         because it used to only support anon/shmem/hugetlb. The problem is for
         a dirty tracking purpose these three types may not be enough, and it's
         legal to track anything e.g. any page cache writes from mmap.
      
      2. Protections can be applied to partial of a memory range, without vma
         split/merge fuss.  The hope is that the tracking itself should not
         affect any vma layout change.  It also helps when reset happens because
         the reset will not need mmap write lock which can block the tracee.
      
      3. Accuracy needs to be maintained.  This means we need pte markers to work
         on any type of VMA.
      
      One could question that, the whole concept of async dirty tracking is not
      really close to fundamentally what userfaultfd used to be: it's not "a
      fault to be serviced by userspace" anymore. However, using userfaultfd-wp
      here as a framework is convenient for us in at least:
      
      1. VM_UFFD_WP vma flag, which has a very good name to suite something like
         this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
         feature bit to identify from a sync version of uffd-wp registration.
      
      2. PTE markers logic can be leveraged across the whole kernel to maintain
         the uffd-wp bit as long as an arch supports, this also applies to this
         case where uffd-wp bit will be a hint to dirty information and it will
         not go lost easily (e.g. when some page cache ptes got zapped).
      
      3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
         resetting a range of memory, while there's no counterpart in the old
         soft-dirty world, hence if this is wanted in a new design we'll need a
         new interface otherwise.
      
      We can somehow understand that commonality because uffd-wp was
      fundamentally a similar idea of write-protecting pages just like
      soft-dirty.
      
      This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
      far WP_ASYNC seems to not usable if without WP_UNPOPULATE.  This also
      gives us chance to modify impl of WP_ASYNC just in case it could be not
      depending on WP_UNPOPULATED anymore in the future kernels.  It's also fine
      to imply that because both features will rely on PTE_MARKER_UFFD_WP config
      option, so they'll show up together (or both missing) in an UFFDIO_API
      probe.
      
      vma_can_userfault() now allows any VMA if the userfaultfd registration is
      only about async uffd-wp.  So we can track dirty for all kinds of memory
      including generic file systems (like XFS, EXT4 or BTRFS).
      
      One trick worth mention in do_wp_page() is that we need to manually update
      vmf->orig_pte here because it can be used later with a pte_same() check -
      this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.
      
      The major defect of this approach of dirty tracking is we need to populate
      the pgtables when tracking starts.  Soft-dirty doesn't do it like that. 
      It's unwanted in the case where the range of memory to track is huge and
      unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
      without having any page cache installed yet).  One way to improve this is
      to allow pte markers exist for larger than PTE level for PMD+.  That will
      not change the interface if to implemented, so we can leave that for
      later.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com
      Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Co-developed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d61ea1cb
    • Rik van Riel's avatar
      hugetlbfs: close race between MADV_DONTNEED and page fault · 2820b0f0
      Rik van Riel authored
      Malloc libraries, like jemalloc and tcalloc, take decisions on when to
      call madvise independently from the code in the main application.
      
      This sometimes results in the application page faulting on an address,
      right after the malloc library has shot down the backing memory with
      MADV_DONTNEED.
      
      Usually this is harmless, because we always have some 4kB pages sitting
      around to satisfy a page fault.  However, with hugetlbfs systems often
      allocate only the exact number of huge pages that the application wants.
      
      Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
      any lock taken on the page fault path, which can open up the following
      race condition:
      
             CPU 1                            CPU 2
      
             MADV_DONTNEED
             unmap page
             shoot down TLB entry
                                             page fault
                                             fail to allocate a huge page
                                             killed with SIGBUS
             free page
      
      Fix that race by pulling the locking from __unmap_hugepage_final_range
      into helper functions called from zap_page_range_single.  This ensures
      page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
      have actually been freed.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2820b0f0
  7. 16 Oct, 2023 4 commits
  8. 04 Oct, 2023 2 commits
  9. 24 Aug, 2023 9 commits