1. 13 May, 2022 11 commits
    • SeongJae Park's avatar
      mm/damon/vaddr: register a damon_operations for fixed virtual address ranges monitoring · de6d0154
      SeongJae Park authored
      Patch series "support fixed virtual address ranges monitoring".
      
      The monitoring operations set for virtual address spaces automatically
      updates the monitoring target regions to cover entire mappings of the
      virtual address spaces as much as possible.  Some users could have more
      information about their programs than kernel and therefore have interest
      in not entire regions but only specific regions.  For such cases, the
      automatic monitoring target regions updates are only unnecessary overhead
      or distractions.
      
      This patchset adds supports for the use case on DAMON's kernel API
      (DAMON_OPS_FVADDR) and sysfs interface ('fvaddr' keyword for 'operations'
      sysfs file).
      
      
      This patch (of 3):
      
      The monitoring operations set for virtual address spaces automatically
      updates the monitoring target regions to cover entire mappings of the
      virtual address spaces as much as possible.  Some users could have more
      information about their programs than kernel and therefore have interest
      in not entire regions but only specific regions.  For such cases, the
      automatic monitoring target regions updates are only unnecessary overheads
      or distractions.
      
      For such cases, DAMON's API users can simply set the '->init()' and
      '->update()' of the DAMON context's '->ops' NULL, and set the target
      monitoring regions when creating the context.  But, that would be a dirty
      hack.  Worse yet, the hack is unavailable for DAMON user space interface
      users.
      
      To support the use case in a clean way that can easily exported to the
      user space, this commit adds another monitoring operations set called
      'fvaddr', which is same to 'vaddr' but does not automatically update the
      monitoring regions.  Instead, it will only respect the virtual address
      regions which have explicitly passed at the initial context creation.
      
      Note that this commit leave sysfs interface not supporting the feature
      yet.  The support will be made in a following commit.
      
      Link: https://lkml.kernel.org/r/20220426231750.48822-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20220426231750.48822-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de6d0154
    • SeongJae Park's avatar
      Docs/{ABI,admin-guide}/damon: document 'avail_operations' sysfs file · 2fe60ec9
      SeongJae Park authored
      This commit updates the DAMON ABI and usage documents for the new sysfs
      file, 'avail_operations'.
      
      Link: https://lkml.kernel.org/r/20220426203843.45238-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2fe60ec9
    • SeongJae Park's avatar
      selftets/damon/sysfs: test existence and permission of avail_operations · f893abbd
      SeongJae Park authored
      This commit adds a selftest test case for ensuring the existence and the
      permission (read-only) of the 'avail_oprations' DAMON sysfs file.
      
      Link: https://lkml.kernel.org/r/20220426203843.45238-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f893abbd
    • SeongJae Park's avatar
      mm/damon/sysfs: add a file for listing available monitoring ops · 0f2cb588
      SeongJae Park authored
      DAMON programming interface users can know if specific monitoring ops set
      is registered or not using 'damon_is_registered_ops()', but there is no
      such method for the user space.  To help the case, this commit adds a new
      DAMON sysfs file called 'avail_operations' under each context directory
      for listing available monitoring ops.  Reading the file will list each
      registered monitoring ops on each line.
      
      Link: https://lkml.kernel.org/r/20220426203843.45238-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f2cb588
    • SeongJae Park's avatar
      mm/damon/core: add a function for damon_operations registration checks · 152e5617
      SeongJae Park authored
      Patch series "mm/damon: allow users know which monitoring ops are available".
      
      DAMON users can configure it for vaious address spaces including virtual
      address spaces and the physical address space by setting its monitoring
      operations set with appropriate one for their purpose.  However, there is
      no celan and simple way to know exactly which monitoring operations sets
      are available on the currently running kernel.
      
      This patchset adds functions for the purpose on DAMON's kernel API
      ('damon_is_registered_ops()') and sysfs interface ('avail_operations' file
      under each context directory).
      
      
      This patch (of 4):
      
      To know if a specific 'damon_operations' is registered, users need to
      check the kernel config or try 'damon_select_ops()' with the ops of the
      question, and then see if it successes.  In the latter case, the user
      should also revert the change.  To make the process simple and convenient,
      this commit adds a function for checking if a specific 'damon_operations'
      is registered or not.
      
      Link: https://lkml.kernel.org/r/20220426203843.45238-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20220426203843.45238-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      152e5617
    • Fabio M. De Francesco's avatar
      mm/highmem: VM_BUG_ON() if offset + len > PAGE_SIZE · f38adfef
      Fabio M. De Francesco authored
      Add VM_BUG_ON() bounds checking to make sure that, if "offset + len>
      PAGE_SIZE", memset() does not corrupt data in adjacent pages.
      
      Mainly to match all the similar functions in highmem.h.
      
      Link: https://lkml.kernel.org/r/20220426193020.8710-1-fmdefrancesco@gmail.comSigned-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Collingbourne <pcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f38adfef
    • huangshaobo's avatar
      kfence: enable check kfence canary on panic via boot param · 3c81b3bb
      huangshaobo authored
      Out-of-bounds accesses that aren't caught by a guard page will result in
      corruption of canary memory.  In pathological cases, where an object has
      certain alignment requirements, an out-of-bounds access might never be
      caught by the guard page.  Such corruptions, however, are only detected on
      kfree() normally.  If the bug causes the kernel to panic before kfree(),
      KFENCE has no opportunity to report the issue.  Such corruptions may also
      indicate failing memory or other faults.
      
      To provide some more information in such cases, add the option to check
      canary bytes on panic.  This might help narrow the search for the panic
      cause; but, due to only having the allocation stack trace, such reports
      are difficult to use to diagnose an issue alone.  In most cases, such
      reports are inactionable, and is therefore an opt-in feature (disabled by
      default).
      
      [akpm@linux-foundation.org: add __read_mostly, per Marco]
      Link: https://lkml.kernel.org/r/20220425022456.44300-1-huangshaobo6@huawei.comSigned-off-by: default avatarhuangshaobo <huangshaobo6@huawei.com>
      Suggested-by: default avatarchenzefeng <chenzefeng2@huawei.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Cc: Wangbing <wangbing6@huawei.com>
      Cc: Jubin Zhong <zhongjubin@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c81b3bb
    • Mina Almasry's avatar
      hugetlbfs: fix hugetlbfs_statfs() locking · 4b25f030
      Mina Almasry authored
      After commit db71ef79 ("hugetlb: make free_huge_page irq safe"), the
      subpool lock should be locked with spin_lock_irq() and all call sites was
      modified as such, except for the ones in hugetlbfs_statfs().
      
      Link: https://lkml.kernel.org/r/20220429202207.3045-1-almasrymina@google.com
      Fixes: db71ef79 ("hugetlb: make free_huge_page irq safe")
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b25f030
    • Nadav Amit's avatar
      mm: avoid unnecessary flush on change_huge_pmd() · 4f831457
      Nadav Amit authored
      Calls to change_protection_range() on THP can trigger, at least on x86,
      two TLB flushes for one page: one immediately, when pmdp_invalidate() is
      called by change_huge_pmd(), and then another one later (that can be
      batched) when change_protection_range() finishes.
      
      The first TLB flush is only necessary to prevent the dirty bit (and with a
      lesser importance the access bit) from changing while the PTE is modified.
      However, this is not necessary as the x86 CPUs set the dirty-bit
      atomically with an additional check that the PTE is (still) present.  One
      caveat is Intel's Knights Landing that has a bug and does not do so.
      
      Leverage this behavior to eliminate the unnecessary TLB flush in
      change_huge_pmd().  Introduce a new arch specific pmdp_invalidate_ad()
      that only invalidates the access and dirty bit from further changes.
      
      Link: https://lkml.kernel.org/r/20220401180821.1986781-4-namit@vmware.comSigned-off-by: default avatarNadav Amit <namit@vmware.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f831457
    • Nadav Amit's avatar
      mm/mprotect: do not flush when not required architecturally · c9fe6656
      Nadav Amit authored
      Currently, using mprotect() to unprotect a memory region or uffd to
      unprotect a memory region causes a TLB flush.  However, in such cases the
      PTE is often not modified (i.e., remain RO) and therefore not TLB flush is
      needed.
      
      Add an arch-specific pte_needs_flush() which tells whether a TLB flush is
      needed based on the old PTE and the new one.  Implement an x86
      pte_needs_flush().
      
      Always flush the TLB when it is architecturally needed even when skipping
      a TLB flush might only result in a spurious page-faults by skipping the
      flush.
      
      Even with such conservative manner, we can in the future further refine
      the checks to test whether a PTE is present by only considering the
      architectural _PAGE_PRESENT flag instead of {pte|pmd}_preesnt().  For not
      be careful and use the latter.
      
      Link: https://lkml.kernel.org/r/20220401180821.1986781-3-namit@vmware.comSigned-off-by: default avatarNadav Amit <namit@vmware.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c9fe6656
    • Nadav Amit's avatar
      mm/mprotect: use mmu_gather · 4a18419f
      Nadav Amit authored
      Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
      
      This patchset is intended to remove unnecessary TLB flushes during
      mprotect() syscalls.  Once this patch-set make it through, similar and
      further optimizations for MADV_COLD and userfaultfd would be possible.
      
      Basically, there are 3 optimizations in this patch-set:
      
      1. Use TLB batching infrastructure to batch flushes across VMAs and do
         better/fewer flushes.  This would also be handy for later userfaultfd
         enhancements.
      
      2. Avoid unnecessary TLB flushes.  This optimization is the one that
         provides most of the performance benefits.  Unlike previous versions,
         we now only avoid flushes that would not result in spurious
         page-faults.
      
      3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
         prevent the A/D bits from changing.
      
      Andrew asked for some benchmark numbers.  I do not have an easy
      determinate macrobenchmark in which it is easy to show benefit.  I
      therefore ran a microbenchmark: a loop that does the following on
      anonymous memory, just as a sanity check to see that time is saved by
      avoiding TLB flushes.  The loop goes:
      
      	mprotect(p, PAGE_SIZE, PROT_READ)
      	mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
      	*p = 0; // make the page writable
      
      The test was run in KVM guest with 1 or 2 threads (the second thread was
      busy-looping).  I measured the time (cycles) of each operation:
      
      		1 thread		2 threads
      		mmots	+patch		mmots	+patch
      PROT_READ	3494	2725 (-22%)	8630	7788 (-10%)
      PROT_READ|WRITE	3952	2724 (-31%)	9075	2865 (-68%)
      
      [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
      
      The exact numbers are really meaningless, but the benefit is clear.  There
      are 2 interesting results though.  
      
      (1) PROT_READ is cheaper, while one can expect it not to be affected. 
      This is presumably due to TLB miss that is saved
      
      (2) Without memory access (*p = 0), the speedup of the patch is even
      greater.  In that scenario mprotect(PROT_READ) also avoids the TLB flush. 
      As a result both operations on the patched kernel take roughly ~1500
      cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
      high as presented in the table.
      
      
      This patch (of 3):
      
      change_pXX_range() currently does not use mmu_gather, but instead
      implements its own deferred TLB flushes scheme.  This both complicates the
      code, as developers need to be aware of different invalidation schemes,
      and prevents opportunities to avoid TLB flushes or perform them in finer
      granularity.
      
      The use of mmu_gather for modified PTEs has benefits in various scenarios
      even if pages are not released.  For instance, if only a single page needs
      to be flushed out of a range of many pages, only that page would be
      flushed.  If a THP page is flushed, on x86 a single TLB invlpg instruction
      can be used instead of 512 instructions (or a full TLB flush, which would
      Linux would actually use by default).  mprotect() over multiple VMAs
      requires a single flush.
      
      Use mmu_gather in change_pXX_range().  As the pages are not released, only
      record the flushed range using tlb_flush_pXX_range().
      
      Handle THP similarly and get rid of flush_cache_range() which becomes
      redundant since tlb_start_vma() calls it when needed.
      
      Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
      Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.comSigned-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a18419f
  2. 10 May, 2022 29 commits
    • NeilBrown's avatar
      VFS: add FMODE_CAN_ODIRECT file flag · a2ad63da
      NeilBrown authored
      Currently various places test if direct IO is possible on a file by
      checking for the existence of the direct_IO address space operation.
      This is a poor choice, as the direct_IO operation may not be used - it is
      only used if the generic_file_*_iter functions are called for direct IO
      and some filesystems - particularly NFS - don't do this.
      
      Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
      various places to check this (avoiding pointer dereferences).
      do_dentry_open() will set this flag if ->direct_IO is present, so
      filesystems do not need to be changed.
      
      NFS *is* changed, to set the flag explicitly and discard the direct_IO
      entry in the address_space_operations for files.
      
      Other filesystems which currently use noop_direct_IO could usefully be
      changed to set this flag instead.
      
      Link: https://lkml.kernel.org/r/164859778128.29473.15189737957277399416.stgit@noble.brownReviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a2ad63da
    • NeilBrown's avatar
      MM: handle THP in swap_*page_fs() - count_vm_events() · 6341a446
      NeilBrown authored
      We need to use count_swpout_vm_event() for sio_write_complete() to get
      correct counting.
      
      Note that THP swap in (if it ever happens) is current accounted 1 for each
      page, whether HUGE or normal.  This is different from swap-out accounting.
      
      This patch should be squashed into
          MM: handle THP in swap_*page_fs()
      
      Link: https://lkml.kernel.org/r/165146948934.24404.5909750610552745025@noble.neil.brown.nameSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reported-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6341a446
    • NeilBrown's avatar
      mm: handle THP in swap_*page_fs() · a1a0dfd5
      NeilBrown authored
      Pages passed to swap_readpage()/swap_writepage() are not necessarily all
      the same size - there may be transparent-huge-pages involves.
      
      The BIO paths of swap_*page() handle this correctly, but the SWP_FS_OPS
      path does not.
      
      So we need to use thp_size() to find the size, not just assume PAGE_SIZE,
      and we need to track the total length of the request, not just assume it
      is "page * PAGE_SIZE".
      
      Link: https://lkml.kernel.org/r/165119301488.15698.9457662928942765453.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reported-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a1a0dfd5
    • NeilBrown's avatar
      mm: submit multipage write for SWP_FS_OPS swap-space · 2282679f
      NeilBrown authored
      swap_writepage() is given one page at a time, but may be called repeatedly
      in succession.
      
      For block-device swapspace, the blk_plug functionality allows the multiple
      pages to be combined together at lower layers.  That cannot be used for
      SWP_FS_OPS as blk_plug may not exist - it is only active when
      CONFIG_BLOCK=y.  Consequently all swap reads over NFS are single page
      reads.
      
      With this patch we pass a pointer-to-pointer via the wbc.  swap_writepage
      can store state between calls - much like the pointer passed explicitly to
      swap_readpage.  After calling swap_writepage() some number of times, the
      state will be passed to swap_write_unplug() which can submit the combined
      request.
      
      Link: https://lkml.kernel.org/r/164859778128.29473.5191868522654408537.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2282679f
    • NeilBrown's avatar
      mm: submit multipage reads for SWP_FS_OPS swap-space · 5169b844
      NeilBrown authored
      swap_readpage() is given one page at a time, but may be called repeatedly
      in succession.
      
      For block-device swap-space, the blk_plug functionality allows the
      multiple pages to be combined together at lower layers.  That cannot be
      used for SWP_FS_OPS as blk_plug may not exist - it is only active when
      CONFIG_BLOCK=y.  Consequently all swap reads over NFS are single page
      reads.
      
      With this patch we pass in a pointer-to-pointer when swap_readpage can
      store state between calls - much like the effect of blk_plug.  After
      calling swap_readpage() some number of times, the state will be passed to
      swap_read_unplug() which can submit the combined request.
      
      Link: https://lkml.kernel.org/r/164859778127.29473.14059420492644907783.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5169b844
    • NeilBrown's avatar
      doc: update documentation for swap_activate and swap_rw · cba738f6
      NeilBrown authored
      This documentation for ->swap_activate() has been out-of-date for a long
      time.  This patch updates it to match recent changes, and adds
      documentation for the associated ->swap_rw()
      
      Link: https://lkml.kernel.org/r/164859778126.29473.6778751233552859461.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cba738f6
    • NeilBrown's avatar
      mm: perform async writes to SWP_FS_OPS swap-space using ->swap_rw · 7eadabc0
      NeilBrown authored
      This patch switches swap-out to SWP_FS_OPS swap-spaces to use ->swap_rw
      and makes the writes asynchronous, like they are for other swap spaces.
      
      To make it async we need to allocate the kiocb struct from a mempool. 
      This may block, but won't block as long as waiting for the write to
      complete.  At most it will wait for some previous swap IO to complete.
      
      Link: https://lkml.kernel.org/r/164859778126.29473.12399585304843922231.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7eadabc0
    • NeilBrown's avatar
      nfs: rename nfs_direct_IO and use as ->swap_rw · eb79f3af
      NeilBrown authored
      The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
      while.  We now need a ->swap_rw function which behaves slightly
      differently, returning zero for success rather than a byte count.
      
      So modify nfs_direct_IO accordingly, rename it, and use it as the
      ->swap_rw function.
      
      Link: https://lkml.kernel.org/r/165119301493.15698.7491285551903597618.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> (on Renesas RSK+RZA1 with 32 MiB of SDRAM)
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eb79f3af
    • NeilBrown's avatar
      mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space · e1209d3a
      NeilBrown authored
      swap currently uses ->readpage to read swap pages.  This can only request
      one page at a time from the filesystem, which is not most efficient.
      
      swap uses ->direct_IO for writes which while this is adequate is an
      inappropriate over-loading.  ->direct_IO may need to had handle allocate
      space for holes or other details that are not relevant for swap.
      
      So this patch introduces a new address_space operation: ->swap_rw.  In
      this patch it is used for reads, and a subsequent patch will switch writes
      to use it.
      
      No filesystem yet supports ->swap_rw, but that is not a problem because
      no filesystem actually works with filesystem-based swap.
      Only two filesystems set SWP_FS_OPS:
      - cifs sets the flag, but ->direct_IO always fails so swap cannot work.
      - nfs sets the flag, but ->direct_IO calls generic_write_checks()
        which has failed on swap files for several releases.
      
      To ensure that a NULL ->swap_rw isn't called, ->activate_swap() for both
      NFS and cifs are changed to fail if ->swap_rw is not set.  This can be
      removed if/when the function is added.
      
      Future patches will restore swap-over-NFS functionality.
      
      To submit an async read with ->swap_rw() we need to allocate a structure
      to hold the kiocb and other details.  swap_readpage() cannot handle
      transient failure, so we create a mempool to provide the structures.
      
      Link: https://lkml.kernel.org/r/164859778125.29473.13430559328221330589.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e1209d3a
    • NeilBrown's avatar
      mm: reclaim mustn't enter FS for SWP_FS_OPS swap-space · d791ea67
      NeilBrown authored
      If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
      safe to enter the FS for reclaim.  So only down-grade the requirement for
      swap pages to __GFP_IO after checking that SWP_FS_OPS are not being used.
      
      This makes the calculation of "may_enter_fs" slightly more complex, so
      move it into a separate function.  with that done, there is little value
      in maintaining the bool variable any more.  So replace the may_enter_fs
      variable with a may_enter_fs() function.  This removes any risk for the
      variable becoming out-of-date.
      
      Link: https://lkml.kernel.org/r/164859778124.29473.16176717935781721855.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d791ea67
    • NeilBrown's avatar
      mm: move responsibility for setting SWP_FS_OPS to ->swap_activate · 4b60c0ff
      NeilBrown authored
      If a filesystem wishes to handle all swap IO itself (via ->direct_IO and
      ->readpage), rather than just providing devices addresses for
      submit_bio(), SWP_FS_OPS must be set.
      
      Currently the protocol for setting this it to have ->swap_activate return
      zero.  In that case SWP_FS_OPS is set, and add_swap_extent() is called for
      the entire file.
      
      This is a little clumsy as different return values for ->swap_activate
      have quite different meanings, and it makes it hard to search for which
      filesystems require SWP_FS_OPS to be set.
      
      So remove the special meaning of a zero return, and require the filesystem
      to set SWP_FS_OPS if it so desires, and to always call add_swap_extent()
      as required.
      
      Currently only NFS and CIFS return zero for add_swap_extent().
      
      Link: https://lkml.kernel.org/r/164859778123.29473.17908205846599043598.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b60c0ff
    • NeilBrown's avatar
      mm: drop swap_dirty_folio · 4c4a7634
      NeilBrown authored
      folios that are written to swap are owned by the MM subsystem - not any
      filesystem.
      
      When such a folio is passed to a filesystem to be written out to a
      swap-file, the filesystem handles the data, but the folio itself does not
      belong to the filesystem.  So calling the filesystem's ->dirty_folio()
      address_space operation makes no sense.  This is for folios in the given
      address space, and a folio to be written to swap does not exist in the
      given address space.
      
      So drop swap_dirty_folio() which calls the address-space's
      ->dirty_folio(), and always use noop_dirty_folio(), which is appropriate
      for folios being swapped out.
      
      Link: https://lkml.kernel.org/r/164859778123.29473.6900942583784889976.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c4a7634
    • NeilBrown's avatar
      mm: create new mm/swap.h header file · 014bb1de
      NeilBrown authored
      Patch series "MM changes to improve swap-over-NFS support".
      
      Assorted improvements for swap-via-filesystem.
      
      This is a resend of these patches, rebased on current HEAD.  The only
      substantial changes is that swap_dirty_folio has replaced
      swap_set_page_dirty.
      
      Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
      has previously worked for NFS but that broke a few releases back.  This
      series changes to use a new ->swap_rw rather than ->readpage and
      ->direct_IO.  It also makes other improvements.
      
      There is a companion series already in linux-next which fixes various
      issues with NFS.  Once both series land, a final patch is needed which
      changes NFS over to use ->swap_rw.
      
      
      This patch (of 10):
      
      Many functions declared in include/linux/swap.h are only used within mm/
      
      Create a new "mm/swap.h" and move some of these declarations there.
      Remove the redundant 'extern' from the function declarations.
      
      [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
      Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
      Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      014bb1de
    • Joel Savitz's avatar
      selftests: clarify common error when running gup_test · 17de1e55
      Joel Savitz authored
      The gup_test binary will fail showing only the output of perror("open") in
      the case that /sys/kernel/debug/gup_test is not found. This will almost
      always be due to CONFIG_GUP_TEST not being set, which enables
      compilation of a kernel that provides this file.
      
      Add a short error message to clarify this failure and point the user to
      the solution.
      
      Link: https://lkml.kernel.org/r/20220502224942.995427-1-jsavitz@redhat.comSigned-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Nico Pache <npache@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17de1e55
    • Yury Norov's avatar
      mm/gup: fix comments to pin_user_pages_*() · 0768c8de
      Yury Norov authored
      pin_user_pages API forces FOLL_PIN in gup_flags, which means that the API
      requires struct page **pages to be provided (not NULL).  However, the
      comment to pin_user_pages() clearly allows for passing in a NULL @pages
      argument.
      
      Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
      enforce the API.
      
      It has been independently spotted by Minchan Kim and confirmed with
      John Hubbard:
      
      https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/
      
      Link: https://lkml.kernel.org/r/20220422015839.1274328-1-yury.norov@gmail.comSigned-off-by: default avatarYury Norov (NVIDIA) <yury.norov@gmail.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0768c8de
    • David Hildenbrand's avatar
      powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s · bff9beaa
      David Hildenbrand authored
      Right now, the last 5 bits (0x1f) of the swap entry are used for the type
      and the bit before that (0x20) is used for _PAGE_SWP_SOFT_DIRTY.  We
      cannot use 0x40, as that collides with _RPAGE_RSV1 -- contained in
      _PAGE_HPTEFLAGS.  The next candidate would be _RPAGE_SW3 (0x200) -- which
      is used for _PAGE_SOFT_DIRTY for !swp ptes.
      
      So let's just use _PAGE_SOFT_DIRTY for _PAGE_SWP_SOFT_DIRTY (to make it
      easier to grasp) and use 0x20 now for _PAGE_SWP_EXCLUSIVE.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bff9beaa
    • David Hildenbrand's avatar
      powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s · 03ac1b71
      David Hildenbrand authored
      The swap type is simply stored in bits 0x1f of the swap pte.  Let's
      simplify by just getting rid of _PAGE_BIT_SWAP_TYPE.  It's not like that
      we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
      _RPAGE_RSV1, which isn't possible and would make the
      BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
      
      While at it, make it clearer which bit we're actually using for
      _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and use
      SWP_TYPE_MASK.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      03ac1b71
    • David Hildenbrand's avatar
      s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 92cd58bd
      David Hildenbrand authored
      Let's use bit 52, which is unused.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92cd58bd
    • David Hildenbrand's avatar
      s390/pgtable: cleanup description of swp pte layout · 8043d26c
      David Hildenbrand authored
      Bit 52 and bit 55 don't have to be zero: they only trigger a
      translation-specifiation exception if the PTE is marked as valid, which is
      not the case for swap ptes.
      
      Document which bits are used for what, and which ones are unused.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8043d26c
    • David Hildenbrand's avatar
      arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 570ef363
      David Hildenbrand authored
      Let's use one of the type bits: core-mm only supports 5, so there is no
      need to consume 6.
      
      Note that we might be able to reuse bit 1, but reusing bit 1 turned out
      problematic in the past for PROT_NONE handling; so let's play safe and use
      another bit.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-5-david@redhat.comReviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      570ef363
    • David Hildenbrand's avatar
      x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 3e20889c
      David Hildenbrand authored
      Let's use bit 3 to remember PG_anon_exclusive in swap ptes.
      
      [david@redhat.com: fix 32-bit swap layout]
        Link: https://lkml.kernel.org/r/d875c292-46b3-f281-65ae-71d0b0c6f592@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e20889c
    • David Hildenbrand's avatar
      mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 210d1e8a
      David Hildenbrand authored
      Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      210d1e8a
    • David Hildenbrand's avatar
      mm/swap: remember PG_anon_exclusive via a swp pte bit · 1493a191
      David Hildenbrand authored
      Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.
      
      This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
      | FOLL_GET) was taken on an anonymous page and COW logic fails to detect
      exclusivity of the page to then replacing the anonymous page by a copy in
      the page table: The GUP reference lost synchronicity with the pages mapped
      into the page tables.  This series focuses on x86, arm64, s390x and
      ppc64/book3s -- other architectures are fairly easy to support by
      implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
      
      This primarily fixes the O_DIRECT memory corruptions that can happen on
      concurrent swapout, whereby we lose DMA reads to a page (modifying the
      user page by writing to it).
      
      O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
      from/to a user page.  In the long run, we want to convert it to properly
      use FOLL_PIN, and John is working on it, but that might take a while and
      might not be easy to backport.  In the meantime, let's restore what used
      to work before we started modifying our COW logic: make R/W FOLL_GET
      references reliable as long as there is no fork() after GUP involved.
      
      This is just the natural follow-up of part 2, that will also further
      reduce "wrong COW" on the swapin path, for example, when we cannot remove
      a page from the swapcache due to concurrent writeback, or if we have two
      threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
      nice side-product
      
      This issue, including other related COW issues, has been summarized in [3]
      under 2):
      "
        2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)
      
        It was discovered that we can create a memory corruption by reading a
        file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
        concurrently writing to an unrelated part (e.g., last byte) of the same
        page, and concurrently write-protecting the page via clear_refs
        SOFTDIRTY tracking [6].
      
        For the reproducer, the issue is that O_DIRECT grabs a reference of the
        target page (via FOLL_GET) and clear_refs write-protects the relevant
        page table entry. On successive write access to the page from the
        process itself, we wrongly COW the page when resolving the write fault,
        resulting in a loss of synchronicity and consequently a memory corruption.
      
        While some people might think that using clear_refs in this combination
        is a corner cases, it turns out to be a more generic problem unfortunately.
      
        For example, it was just recently discovered that we can similarly
        create a memory corruption without clear_refs, simply by concurrently
        swapping out the buffer pages [7]. Note that we nowadays even use the
        swap infrastructure in Linux without an actual swap disk/partition: the
        prime example is zram which is enabled as default under Fedora [10].
      
        The root issue is that a write-fault on a page that has additional
        references results in a COW and thereby a loss of synchronicity
        and consequently a memory corruption if two parties believe they are
        referencing the same page.
      "
      
      We don't particularly care about R/O FOLL_GET references: they were never
      reliable and O_DIRECT doesn't expect to observe modifications from a page
      after DMA was started.
      
      Note that:
      * this only fixes the issue on x86, arm64, s390x and ppc64/book3s
        ("enterprise architectures"). Other architectures have to implement
        __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
      * this does *not * consider any kind of fork() after taking the reference:
        fork() after GUP never worked reliably with FOLL_GET.
      * Not losing PG_anon_exclusive during swapout was the last remaining
        piece. KSM already makes sure that there are no other references on
        a page before considering it for sharing. Page migration maintains
        PG_anon_exclusive and simply fails when there are additional references
        (freezing the refcount fails). Only swapout code dropped the
        PG_anon_exclusive flag because it requires more work to remember +
        restore it.
      
      With this series in place, most COW issues of [3] are fixed on said
      architectures. Other architectures can implement
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.
      
      [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      
      This patch (of 8):
      
      Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
      it.  We do this, to keep fork() logic on swap entries easy and efficient:
      for example, if we wouldn't clear it when unmapping, we'd have to lookup
      the page in the swapcache for each and every swap entry during fork() and
      clear PG_anon_exclusive if set.
      
      Instead, we want to store that information directly in the swap pte,
      protected by the page table lock, similarly to how we handle
      SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
      swap entries, we don't want to mess with the swap type (e.g., still one
      bit) because it overcomplicates swap code.
      
      In try_to_unmap(), we already reject to unmap in case the page might be
      pinned, because we must not lose PG_anon_exclusive on pinned pages ever. 
      Checking if there are other unexpected references reliably *before*
      completely unmapping a page is unfortunately not really possible: THP
      heavily overcomplicate the situation.  Once fully unmapped it's easier --
      we, for example, make sure that there are no unexpected references *after*
      unmapping a page before starting writeback on that page.
      
      So, we currently might end up unmapping a page and clearing
      PG_anon_exclusive if that page has additional references, for example, due
      to a FOLL_GET.
      
      do_swap_page() has to re-determine if a page is exclusive, which will
      easily fail if there are other references on a page, most prominently GUP
      references via FOLL_GET.  This can currently result in memory corruptions
      when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
      is never involved: try_to_unmap() will succeed, and when refaulting the
      page, it cannot be marked exclusive and will get replaced by a copy in the
      page tables on the next write access, resulting in writes via the GUP
      reference to the page being lost.
      
      In an ideal world, everybody that uses GUP and wants to modify page
      content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
      conversion will take a while.  It's easier to fix what used to work in the
      past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
      by remembering PG_anon_exclusive we can further reduce unnecessary COW in
      some cases, so it's the natural thing to do.
      
      So let's transfer the PG_anon_exclusive information to the swap pte and
      store it via an architecture-dependant pte bit; use that information when
      restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
      we simply have to clear the pte bit and are done.
      
      Of course, there is one corner case to handle: swap backends that don't
      support concurrent page modifications while the page is under writeback. 
      Special case these, and drop the exclusive marker.  Add a comment why that
      is just fine (also, reuse_swap_page() would have done the same in the
      past).
      
      In the future, we'll hopefully have all architectures support
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
      and the define completely.  Then, we can also convert
      SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
      support: either simply use a yet unused pte bit that can be used for swap
      entries, steal one from the arch type bits if they exceed 5, or steal one
      from the offset bits.
      
      Note: R/O FOLL_GET references were never really reliable, especially when
      taking one on a shared page and then writing to the page (e.g., GUP after
      fork()).  FOLL_GET, including R/W references, were never really reliable
      once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
      steps back in case it stumbles over unexpected references and is,
      therefore, fine.
      
      [david@redhat.com: fix SWP_STABLE_WRITES test]
        Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1493a191
    • David Hildenbrand's avatar
      mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning · b6a2619c
      David Hildenbrand authored
      Let's verify when (un)pinning anonymous pages that we always deal with
      exclusive anonymous pages, which guarantees that we'll have a reliable
      PIN, meaning that we cannot end up with the GUP pin being inconsistent
      with he pages mapped into the page tables due to a COW triggered by a
      write fault.
      
      When pinning pages, after conditionally triggering GUP unsharing of
      possibly shared anonymous pages, we should always only see exclusive
      anonymous pages.  Note that anonymous pages that are mapped writable must
      be marked exclusive, otherwise we'd have a BUG.
      
      When pinning during ordinary GUP, simply add a check after our conditional
      GUP-triggered unsharing checks.  As we know exactly how the page is
      mapped, we know exactly in which page we have to check for
      PageAnonExclusive().
      
      When pinning via GUP-fast we have to be careful, because we can race with
      fork(): verify only after we made sure via the seqcount that we didn't
      race with concurrent fork() that we didn't end up pinning a possibly
      shared anonymous page.
      
      Similarly, when unpinning, verify that the pages are still marked as
      exclusive: otherwise something turned the pages possibly shared, which can
      result in random memory corruptions, which we really want to catch.
      
      With only the pinned pages at hand and not the actual page table entries
      we have to be a bit careful: hugetlb pages are always mapped via a single
      logical page table entry referencing the head page and PG_anon_exclusive
      of the head page applies.  Anon THP are a bit more complicated, because we
      might have obtained the page reference either via a PMD or a PTE --
      depending on the mapping type we either have to check PageAnonExclusive of
      the head page (PMD-mapped THP) or the tail page (PTE-mapped THP) applies:
      as we don't know and to make our life easier, check that either is set.
      
      Take care to not verify in case we're unpinning during GUP-fast because we
      detected concurrent fork(): we might stumble over an anonymous page that
      is now shared.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-18-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6a2619c
    • David Hildenbrand's avatar
      mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page · a7f22660
      David Hildenbrand authored
      Whenever GUP currently ends up taking a R/O pin on an anonymous page that
      might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
      on the page table entry will end up replacing the mapped anonymous page
      due to COW, resulting in the GUP pin no longer being consistent with the
      page actually mapped into the page table.
      
      The possible ways to deal with this situation are:
       (1) Ignore and pin -- what we do right now.
       (2) Fail to pin -- which would be rather surprising to callers and
           could break user space.
       (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
           pins.
      
      Let's implement 3) because it provides the clearest semantics and allows
      for checking in unpin_user_pages() and friends for possible BUGs: when
      trying to unpin a page that's no longer exclusive, clearly something went
      very wrong and might result in memory corruptions that might be hard to
      debug.  So we better have a nice way to spot such issues.
      
      This change implies that whenever user space *wrote* to a private mapping
      (IOW, we have an anonymous page mapped), that GUP pins will always remain
      consistent: reliable R/O GUP pins of anonymous pages.
      
      As a side note, this commit fixes the COW security issue for hugetlb with
      FOLL_PIN as documented in:
        https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
      instead of FOLL_PIN.
      
      Note that follow_huge_pmd() doesn't apply because we cannot end up in
      there with FOLL_PIN.
      
      This commit is heavily based on prototype patches by Andrea.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Co-developed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7f22660
    • David Hildenbrand's avatar
      mm: support GUP-triggered unsharing of anonymous pages · c89357e2
      David Hildenbrand authored
      Whenever GUP currently ends up taking a R/O pin on an anonymous page that
      might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
      on the page table entry will end up replacing the mapped anonymous page
      due to COW, resulting in the GUP pin no longer being consistent with the
      page actually mapped into the page table.
      
      The possible ways to deal with this situation are:
       (1) Ignore and pin -- what we do right now.
       (2) Fail to pin -- which would be rather surprising to callers and
           could break user space.
       (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
           pins.
      
      We want to implement 3) because it provides the clearest semantics and
      allows for checking in unpin_user_pages() and friends for possible BUGs:
      when trying to unpin a page that's no longer exclusive, clearly something
      went very wrong and might result in memory corruptions that might be hard
      to debug.  So we better have a nice way to spot such issues.
      
      To implement 3), we need a way for GUP to trigger unsharing:
      FAULT_FLAG_UNSHARE.  FAULT_FLAG_UNSHARE is only applicable to R/O mapped
      anonymous pages and resembles COW logic during a write fault.  However, in
      contrast to a write fault, GUP-triggered unsharing will, for example,
      still maintain the write protection.
      
      Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
      fault handlers for all applicable anonymous page types: ordinary pages,
      THP and hugetlb.
      
      * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
        marked exclusive in the meantime by someone else, there is nothing to do.
      * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
        marked exclusive, it will try detecting if the process is the exclusive
        owner. If exclusive, it can be set exclusive similar to reuse logic
        during write faults via page_move_anon_rmap() and there is nothing
        else to do; otherwise, we either have to copy and map a fresh,
        anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
        THP.
      
      This commit is heavily based on patches by Andrea.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Co-developed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c89357e2
    • David Hildenbrand's avatar
      mm/gup: disallow follow_page(FOLL_PIN) · 8909691b
      David Hildenbrand authored
      We want to change the way we handle R/O pins on anonymous pages that might
      be shared: if we detect a possibly shared anonymous page -- mapped R/O and
      not !PageAnonExclusive() -- we want to trigger unsharing via a page fault,
      resulting in an exclusive anonymous page that can be pinned reliably
      without getting replaced via COW on the next write fault.
      
      However, the required page fault will be problematic for follow_page(): in
      contrast to ordinary GUP, follow_page() doesn't trigger faults internally.
      So we would have to end up failing a R/O pin via follow_page(), although
      there is something mapped R/O into the page table, which might be rather
      surprising.
      
      We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
      internal MM function.  Let's just make our life easier and the semantics
      of follow_page() clearer by just disallowing FOLL_PIN for follow_page()
      completely.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-15-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8909691b
    • David Hildenbrand's avatar
      mm/rmap: fail try_to_migrate() early when setting a PMD migration entry fails · 7f5abe60
      David Hildenbrand authored
      Let's fail right away in case we cannot clear PG_anon_exclusive because
      the anon THP may be pinned.  Right now, we continue trying to install
      migration entries and the caller of try_to_migrate() will realize that the
      page is still mapped and has to restore the migration entries.  Let's just
      fail fast just like for PTE migration entries.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-14-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f5abe60
    • David Hildenbrand's avatar
      mm: remember exclusively mapped anonymous pages with PG_anon_exclusive · 6c287605
      David Hildenbrand authored
      Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
      exclusive, and use that information to make GUP pins reliable and stay
      consistent with the page mapped into the page table even if the page table
      entry gets write-protected.
      
      With that information at hand, we can extend our COW logic to always reuse
      anonymous pages that are exclusive.  For anonymous pages that might be
      shared, the existing logic applies.
      
      As already documented, PG_anon_exclusive is usually only expressive in
      combination with a page table entry.  Especially PTE vs.  PMD-mapped
      anonymous pages require more thought, some examples: due to mremap() we
      can easily have a single compound page PTE-mapped into multiple page
      tables exclusively in a single process -- multiple page table locks apply.
      Further, due to MADV_WIPEONFORK we might not necessarily write-protect
      all PTEs, and only some subpages might be pinned.  Long story short: once
      PTE-mapped, we have to track information about exclusivity per sub-page,
      but until then, we can just track it for the compound page in the head
      page and not having to update a whole bunch of subpages all of the time
      for a simple PMD mapping of a THP.
      
      For simplicity, this commit mostly talks about "anonymous pages", while
      it's for THP actually "the part of an anonymous folio referenced via a
      page table entry".
      
      To not spill PG_anon_exclusive code all over the mm code-base, we let the
      anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
      
      If a writable, present page table entry points at an anonymous (sub)page,
      that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
      pin (FOLL_PIN) on an anonymous page references via a present page table
      entry, it must only pin if PG_anon_exclusive is set for the mapped
      (sub)page.
      
      This commit doesn't adjust GUP, so this is only implicitly handled for
      FOLL_WRITE, follow-up commits will teach GUP to also respect it for
      FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
      reliable.
      
      Whenever an anonymous page is to be shared (fork(), KSM), or when
      temporarily unmapping an anonymous page (swap, migration), the relevant
      PG_anon_exclusive bit has to be cleared to mark the anonymous page
      possibly shared.  Clearing will fail if there are GUP pins on the page:
      
      * For fork(), this means having to copy the page and not being able to
        share it.  fork() protects against concurrent GUP using the PT lock and
        the src_mm->write_protect_seq.
      
      * For KSM, this means sharing will fail.  For swap this means, unmapping
        will fail, For migration this means, migration will fail early.  All
        three cases protect against concurrent GUP using the PT lock and a
        proper clear/invalidate+flush of the relevant page table entry.
      
      This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
      pinned page gets mapped R/O and the successive write fault ends up
      replacing the page instead of reusing it.  It improves the situation for
      O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
      fork() is *not* involved, however swapout and fork() are still
      problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
      users will fix the issue for them.
      
      I. Details about basic handling
      
      I.1. Fresh anonymous pages
      
      page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
      given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
      the mechanism fresh anonymous pages come into life (besides migration code
      where we copy the page->mapping), all fresh anonymous pages will start out
      as exclusive.
      
      I.2. COW reuse handling of anonymous pages
      
      When a COW handler stumbles over a (sub)page that's marked exclusive, it
      simply reuses it.  Otherwise, the handler tries harder under page lock to
      detect if the (sub)page is exclusive and can be reused.  If exclusive,
      page_move_anon_rmap() will mark the given (sub)page exclusive.
      
      Note that hugetlb code does not yet check for PageAnonExclusive(), as it
      still uses the old COW logic that is prone to the COW security issue
      because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
      pages are a scarce resource.
      
      I.3. Migration handling
      
      try_to_migrate() has to try marking an exclusive anonymous page shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  migrate_vma_collect_pmd() and
      __split_huge_pmd_locked() are handled similarly.
      
      Writable migration entries implicitly point at shared anonymous pages. 
      For readable migration entries that information is stored via a new
      "readable-exclusive" migration entry, specific to anonymous pages.
      
      When restoring a migration entry in remove_migration_pte(), information
      about exlusivity is detected via the migration entry type, and
      RMAP_EXCLUSIVE is set accordingly for
      page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
      
      I.4. Swapout handling
      
      try_to_unmap() has to try marking the mapped page possibly shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  For now, information about exclusivity is lost.  In
      the future, we might want to remember that information in the swap entry
      in some cases, however, it requires more thought, care, and a way to store
      that information in swap entries.
      
      I.5. Swapin handling
      
      do_swap_page() will never stumble over exclusive anonymous pages in the
      swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
      to detect manually if an anonymous page is exclusive and has to set
      RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
      
      I.6. THP handling
      
      __split_huge_pmd_locked() has to move the information about exclusivity
      from the PMD to the PTEs.
      
      a) In case we have a readable-exclusive PMD migration entry, simply
         insert readable-exclusive PTE migration entries.
      
      b) In case we have a present PMD entry and we don't want to freeze
         ("convert to migration entries"), simply forward PG_anon_exclusive to
         all sub-pages, no need to temporarily clear the bit.
      
      c) In case we have a present PMD entry and want to freeze, handle it
         similar to try_to_migrate(): try marking the page shared first.  In
         case we fail, we ignore the "freeze" instruction and simply split
         ordinarily.  try_to_migrate() will properly fail because the THP is
         still mapped via PTEs.
      
      When splitting a compound anonymous folio (THP), the information about
      exclusivity is implicitly handled via the migration entries: no need to
      replicate PG_anon_exclusive manually.
      
      I.7.  fork() handling fork() handling is relatively easy, because
      PG_anon_exclusive is only expressive for some page table entry types.
      
      a) Present anonymous pages
      
      page_try_dup_anon_rmap() will mark the given subpage shared -- which will
      fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
      PMD to handle it on the PTE level).
      
      Note that device exclusive entries are just a pointer at a PageAnon()
      page.  fork() will first convert a device exclusive entry to a present
      page table and handle it just like present anonymous pages.
      
      b) Device private entry
      
      Device private entries point at PageAnon() pages that cannot be mapped
      directly and, therefore, cannot get pinned.
      
      page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
      fail because they cannot get pinned.
      
      c) HW poison entries
      
      PG_anon_exclusive will remain untouched and is stale -- the page table
      entry is just a placeholder after all.
      
      d) Migration entries
      
      Writable and readable-exclusive entries are converted to readable entries:
      possibly shared.
      
      I.8. mprotect() handling
      
      mprotect() only has to properly handle the new readable-exclusive
      migration entry:
      
      When write-protecting a migration entry that points at an anonymous page,
      remember the information about exclusivity via the "readable-exclusive"
      migration entry type.
      
      II. Migration and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a migration entry, we have to mark the page possibly
      shared and synchronize against GUP-fast by a proper clear/invalidate+flush
      to make the following scenario impossible:
      
      1. try_to_migrate() places a migration entry after checking for GUP pins
         and marks the page possibly shared.
      
      2. GUP-fast pins the page due to lack of synchronization
      
      3. fork() converts the "writable/readable-exclusive" migration entry into a
         readable migration entry
      
      4. Migration fails due to the GUP pin (failing to freeze the refcount)
      
      5. Migration entries are restored. PG_anon_exclusive is lost
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      Note that we move information about exclusivity from the page to the
      migration entry as it otherwise highly overcomplicates fork() and
      PTE-mapping a THP.
      
      III. Swapout and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a swap entry, we have to mark the page possibly shared
      and synchronize against GUP-fast by a proper clear/invalidate+flush to
      make the following scenario impossible:
      
      1. try_to_unmap() places a swap entry after checking for GUP pins and
         clears exclusivity information on the page.
      
      2. GUP-fast pins the page due to lack of synchronization.
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      If we'd ever store information about exclusivity in the swap entry,
      similar to migration handling, the same considerations as in II would
      apply.  This is future work.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c287605