1. 13 May, 2022 15 commits
  2. 10 May, 2022 25 commits
    • NeilBrown's avatar
      VFS: add FMODE_CAN_ODIRECT file flag · a2ad63da
      NeilBrown authored
      Currently various places test if direct IO is possible on a file by
      checking for the existence of the direct_IO address space operation.
      This is a poor choice, as the direct_IO operation may not be used - it is
      only used if the generic_file_*_iter functions are called for direct IO
      and some filesystems - particularly NFS - don't do this.
      
      Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
      various places to check this (avoiding pointer dereferences).
      do_dentry_open() will set this flag if ->direct_IO is present, so
      filesystems do not need to be changed.
      
      NFS *is* changed, to set the flag explicitly and discard the direct_IO
      entry in the address_space_operations for files.
      
      Other filesystems which currently use noop_direct_IO could usefully be
      changed to set this flag instead.
      
      Link: https://lkml.kernel.org/r/164859778128.29473.15189737957277399416.stgit@noble.brownReviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a2ad63da
    • NeilBrown's avatar
      MM: handle THP in swap_*page_fs() - count_vm_events() · 6341a446
      NeilBrown authored
      We need to use count_swpout_vm_event() for sio_write_complete() to get
      correct counting.
      
      Note that THP swap in (if it ever happens) is current accounted 1 for each
      page, whether HUGE or normal.  This is different from swap-out accounting.
      
      This patch should be squashed into
          MM: handle THP in swap_*page_fs()
      
      Link: https://lkml.kernel.org/r/165146948934.24404.5909750610552745025@noble.neil.brown.nameSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reported-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6341a446
    • NeilBrown's avatar
      mm: handle THP in swap_*page_fs() · a1a0dfd5
      NeilBrown authored
      Pages passed to swap_readpage()/swap_writepage() are not necessarily all
      the same size - there may be transparent-huge-pages involves.
      
      The BIO paths of swap_*page() handle this correctly, but the SWP_FS_OPS
      path does not.
      
      So we need to use thp_size() to find the size, not just assume PAGE_SIZE,
      and we need to track the total length of the request, not just assume it
      is "page * PAGE_SIZE".
      
      Link: https://lkml.kernel.org/r/165119301488.15698.9457662928942765453.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reported-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a1a0dfd5
    • NeilBrown's avatar
      mm: submit multipage write for SWP_FS_OPS swap-space · 2282679f
      NeilBrown authored
      swap_writepage() is given one page at a time, but may be called repeatedly
      in succession.
      
      For block-device swapspace, the blk_plug functionality allows the multiple
      pages to be combined together at lower layers.  That cannot be used for
      SWP_FS_OPS as blk_plug may not exist - it is only active when
      CONFIG_BLOCK=y.  Consequently all swap reads over NFS are single page
      reads.
      
      With this patch we pass a pointer-to-pointer via the wbc.  swap_writepage
      can store state between calls - much like the pointer passed explicitly to
      swap_readpage.  After calling swap_writepage() some number of times, the
      state will be passed to swap_write_unplug() which can submit the combined
      request.
      
      Link: https://lkml.kernel.org/r/164859778128.29473.5191868522654408537.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2282679f
    • NeilBrown's avatar
      mm: submit multipage reads for SWP_FS_OPS swap-space · 5169b844
      NeilBrown authored
      swap_readpage() is given one page at a time, but may be called repeatedly
      in succession.
      
      For block-device swap-space, the blk_plug functionality allows the
      multiple pages to be combined together at lower layers.  That cannot be
      used for SWP_FS_OPS as blk_plug may not exist - it is only active when
      CONFIG_BLOCK=y.  Consequently all swap reads over NFS are single page
      reads.
      
      With this patch we pass in a pointer-to-pointer when swap_readpage can
      store state between calls - much like the effect of blk_plug.  After
      calling swap_readpage() some number of times, the state will be passed to
      swap_read_unplug() which can submit the combined request.
      
      Link: https://lkml.kernel.org/r/164859778127.29473.14059420492644907783.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5169b844
    • NeilBrown's avatar
      doc: update documentation for swap_activate and swap_rw · cba738f6
      NeilBrown authored
      This documentation for ->swap_activate() has been out-of-date for a long
      time.  This patch updates it to match recent changes, and adds
      documentation for the associated ->swap_rw()
      
      Link: https://lkml.kernel.org/r/164859778126.29473.6778751233552859461.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cba738f6
    • NeilBrown's avatar
      mm: perform async writes to SWP_FS_OPS swap-space using ->swap_rw · 7eadabc0
      NeilBrown authored
      This patch switches swap-out to SWP_FS_OPS swap-spaces to use ->swap_rw
      and makes the writes asynchronous, like they are for other swap spaces.
      
      To make it async we need to allocate the kiocb struct from a mempool. 
      This may block, but won't block as long as waiting for the write to
      complete.  At most it will wait for some previous swap IO to complete.
      
      Link: https://lkml.kernel.org/r/164859778126.29473.12399585304843922231.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7eadabc0
    • NeilBrown's avatar
      nfs: rename nfs_direct_IO and use as ->swap_rw · eb79f3af
      NeilBrown authored
      The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
      while.  We now need a ->swap_rw function which behaves slightly
      differently, returning zero for success rather than a byte count.
      
      So modify nfs_direct_IO accordingly, rename it, and use it as the
      ->swap_rw function.
      
      Link: https://lkml.kernel.org/r/165119301493.15698.7491285551903597618.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> (on Renesas RSK+RZA1 with 32 MiB of SDRAM)
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eb79f3af
    • NeilBrown's avatar
      mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space · e1209d3a
      NeilBrown authored
      swap currently uses ->readpage to read swap pages.  This can only request
      one page at a time from the filesystem, which is not most efficient.
      
      swap uses ->direct_IO for writes which while this is adequate is an
      inappropriate over-loading.  ->direct_IO may need to had handle allocate
      space for holes or other details that are not relevant for swap.
      
      So this patch introduces a new address_space operation: ->swap_rw.  In
      this patch it is used for reads, and a subsequent patch will switch writes
      to use it.
      
      No filesystem yet supports ->swap_rw, but that is not a problem because
      no filesystem actually works with filesystem-based swap.
      Only two filesystems set SWP_FS_OPS:
      - cifs sets the flag, but ->direct_IO always fails so swap cannot work.
      - nfs sets the flag, but ->direct_IO calls generic_write_checks()
        which has failed on swap files for several releases.
      
      To ensure that a NULL ->swap_rw isn't called, ->activate_swap() for both
      NFS and cifs are changed to fail if ->swap_rw is not set.  This can be
      removed if/when the function is added.
      
      Future patches will restore swap-over-NFS functionality.
      
      To submit an async read with ->swap_rw() we need to allocate a structure
      to hold the kiocb and other details.  swap_readpage() cannot handle
      transient failure, so we create a mempool to provide the structures.
      
      Link: https://lkml.kernel.org/r/164859778125.29473.13430559328221330589.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e1209d3a
    • NeilBrown's avatar
      mm: reclaim mustn't enter FS for SWP_FS_OPS swap-space · d791ea67
      NeilBrown authored
      If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
      safe to enter the FS for reclaim.  So only down-grade the requirement for
      swap pages to __GFP_IO after checking that SWP_FS_OPS are not being used.
      
      This makes the calculation of "may_enter_fs" slightly more complex, so
      move it into a separate function.  with that done, there is little value
      in maintaining the bool variable any more.  So replace the may_enter_fs
      variable with a may_enter_fs() function.  This removes any risk for the
      variable becoming out-of-date.
      
      Link: https://lkml.kernel.org/r/164859778124.29473.16176717935781721855.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d791ea67
    • NeilBrown's avatar
      mm: move responsibility for setting SWP_FS_OPS to ->swap_activate · 4b60c0ff
      NeilBrown authored
      If a filesystem wishes to handle all swap IO itself (via ->direct_IO and
      ->readpage), rather than just providing devices addresses for
      submit_bio(), SWP_FS_OPS must be set.
      
      Currently the protocol for setting this it to have ->swap_activate return
      zero.  In that case SWP_FS_OPS is set, and add_swap_extent() is called for
      the entire file.
      
      This is a little clumsy as different return values for ->swap_activate
      have quite different meanings, and it makes it hard to search for which
      filesystems require SWP_FS_OPS to be set.
      
      So remove the special meaning of a zero return, and require the filesystem
      to set SWP_FS_OPS if it so desires, and to always call add_swap_extent()
      as required.
      
      Currently only NFS and CIFS return zero for add_swap_extent().
      
      Link: https://lkml.kernel.org/r/164859778123.29473.17908205846599043598.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b60c0ff
    • NeilBrown's avatar
      mm: drop swap_dirty_folio · 4c4a7634
      NeilBrown authored
      folios that are written to swap are owned by the MM subsystem - not any
      filesystem.
      
      When such a folio is passed to a filesystem to be written out to a
      swap-file, the filesystem handles the data, but the folio itself does not
      belong to the filesystem.  So calling the filesystem's ->dirty_folio()
      address_space operation makes no sense.  This is for folios in the given
      address space, and a folio to be written to swap does not exist in the
      given address space.
      
      So drop swap_dirty_folio() which calls the address-space's
      ->dirty_folio(), and always use noop_dirty_folio(), which is appropriate
      for folios being swapped out.
      
      Link: https://lkml.kernel.org/r/164859778123.29473.6900942583784889976.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c4a7634
    • NeilBrown's avatar
      mm: create new mm/swap.h header file · 014bb1de
      NeilBrown authored
      Patch series "MM changes to improve swap-over-NFS support".
      
      Assorted improvements for swap-via-filesystem.
      
      This is a resend of these patches, rebased on current HEAD.  The only
      substantial changes is that swap_dirty_folio has replaced
      swap_set_page_dirty.
      
      Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
      has previously worked for NFS but that broke a few releases back.  This
      series changes to use a new ->swap_rw rather than ->readpage and
      ->direct_IO.  It also makes other improvements.
      
      There is a companion series already in linux-next which fixes various
      issues with NFS.  Once both series land, a final patch is needed which
      changes NFS over to use ->swap_rw.
      
      
      This patch (of 10):
      
      Many functions declared in include/linux/swap.h are only used within mm/
      
      Create a new "mm/swap.h" and move some of these declarations there.
      Remove the redundant 'extern' from the function declarations.
      
      [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
      Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
      Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brownSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      014bb1de
    • Joel Savitz's avatar
      selftests: clarify common error when running gup_test · 17de1e55
      Joel Savitz authored
      The gup_test binary will fail showing only the output of perror("open") in
      the case that /sys/kernel/debug/gup_test is not found. This will almost
      always be due to CONFIG_GUP_TEST not being set, which enables
      compilation of a kernel that provides this file.
      
      Add a short error message to clarify this failure and point the user to
      the solution.
      
      Link: https://lkml.kernel.org/r/20220502224942.995427-1-jsavitz@redhat.comSigned-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Nico Pache <npache@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17de1e55
    • Yury Norov's avatar
      mm/gup: fix comments to pin_user_pages_*() · 0768c8de
      Yury Norov authored
      pin_user_pages API forces FOLL_PIN in gup_flags, which means that the API
      requires struct page **pages to be provided (not NULL).  However, the
      comment to pin_user_pages() clearly allows for passing in a NULL @pages
      argument.
      
      Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
      enforce the API.
      
      It has been independently spotted by Minchan Kim and confirmed with
      John Hubbard:
      
      https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/
      
      Link: https://lkml.kernel.org/r/20220422015839.1274328-1-yury.norov@gmail.comSigned-off-by: default avatarYury Norov (NVIDIA) <yury.norov@gmail.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0768c8de
    • David Hildenbrand's avatar
      powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s · bff9beaa
      David Hildenbrand authored
      Right now, the last 5 bits (0x1f) of the swap entry are used for the type
      and the bit before that (0x20) is used for _PAGE_SWP_SOFT_DIRTY.  We
      cannot use 0x40, as that collides with _RPAGE_RSV1 -- contained in
      _PAGE_HPTEFLAGS.  The next candidate would be _RPAGE_SW3 (0x200) -- which
      is used for _PAGE_SOFT_DIRTY for !swp ptes.
      
      So let's just use _PAGE_SOFT_DIRTY for _PAGE_SWP_SOFT_DIRTY (to make it
      easier to grasp) and use 0x20 now for _PAGE_SWP_EXCLUSIVE.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bff9beaa
    • David Hildenbrand's avatar
      powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s · 03ac1b71
      David Hildenbrand authored
      The swap type is simply stored in bits 0x1f of the swap pte.  Let's
      simplify by just getting rid of _PAGE_BIT_SWAP_TYPE.  It's not like that
      we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
      _RPAGE_RSV1, which isn't possible and would make the
      BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
      
      While at it, make it clearer which bit we're actually using for
      _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and use
      SWP_TYPE_MASK.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      03ac1b71
    • David Hildenbrand's avatar
      s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 92cd58bd
      David Hildenbrand authored
      Let's use bit 52, which is unused.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92cd58bd
    • David Hildenbrand's avatar
      s390/pgtable: cleanup description of swp pte layout · 8043d26c
      David Hildenbrand authored
      Bit 52 and bit 55 don't have to be zero: they only trigger a
      translation-specifiation exception if the PTE is marked as valid, which is
      not the case for swap ptes.
      
      Document which bits are used for what, and which ones are unused.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8043d26c
    • David Hildenbrand's avatar
      arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 570ef363
      David Hildenbrand authored
      Let's use one of the type bits: core-mm only supports 5, so there is no
      need to consume 6.
      
      Note that we might be able to reuse bit 1, but reusing bit 1 turned out
      problematic in the past for PROT_NONE handling; so let's play safe and use
      another bit.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-5-david@redhat.comReviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      570ef363
    • David Hildenbrand's avatar
      x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 3e20889c
      David Hildenbrand authored
      Let's use bit 3 to remember PG_anon_exclusive in swap ptes.
      
      [david@redhat.com: fix 32-bit swap layout]
        Link: https://lkml.kernel.org/r/d875c292-46b3-f281-65ae-71d0b0c6f592@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e20889c
    • David Hildenbrand's avatar
      mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE · 210d1e8a
      David Hildenbrand authored
      Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.
      
      Link: https://lkml.kernel.org/r/20220329164329.208407-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      210d1e8a
    • David Hildenbrand's avatar
      mm/swap: remember PG_anon_exclusive via a swp pte bit · 1493a191
      David Hildenbrand authored
      Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.
      
      This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
      | FOLL_GET) was taken on an anonymous page and COW logic fails to detect
      exclusivity of the page to then replacing the anonymous page by a copy in
      the page table: The GUP reference lost synchronicity with the pages mapped
      into the page tables.  This series focuses on x86, arm64, s390x and
      ppc64/book3s -- other architectures are fairly easy to support by
      implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
      
      This primarily fixes the O_DIRECT memory corruptions that can happen on
      concurrent swapout, whereby we lose DMA reads to a page (modifying the
      user page by writing to it).
      
      O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
      from/to a user page.  In the long run, we want to convert it to properly
      use FOLL_PIN, and John is working on it, but that might take a while and
      might not be easy to backport.  In the meantime, let's restore what used
      to work before we started modifying our COW logic: make R/W FOLL_GET
      references reliable as long as there is no fork() after GUP involved.
      
      This is just the natural follow-up of part 2, that will also further
      reduce "wrong COW" on the swapin path, for example, when we cannot remove
      a page from the swapcache due to concurrent writeback, or if we have two
      threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
      nice side-product
      
      This issue, including other related COW issues, has been summarized in [3]
      under 2):
      "
        2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)
      
        It was discovered that we can create a memory corruption by reading a
        file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
        concurrently writing to an unrelated part (e.g., last byte) of the same
        page, and concurrently write-protecting the page via clear_refs
        SOFTDIRTY tracking [6].
      
        For the reproducer, the issue is that O_DIRECT grabs a reference of the
        target page (via FOLL_GET) and clear_refs write-protects the relevant
        page table entry. On successive write access to the page from the
        process itself, we wrongly COW the page when resolving the write fault,
        resulting in a loss of synchronicity and consequently a memory corruption.
      
        While some people might think that using clear_refs in this combination
        is a corner cases, it turns out to be a more generic problem unfortunately.
      
        For example, it was just recently discovered that we can similarly
        create a memory corruption without clear_refs, simply by concurrently
        swapping out the buffer pages [7]. Note that we nowadays even use the
        swap infrastructure in Linux without an actual swap disk/partition: the
        prime example is zram which is enabled as default under Fedora [10].
      
        The root issue is that a write-fault on a page that has additional
        references results in a COW and thereby a loss of synchronicity
        and consequently a memory corruption if two parties believe they are
        referencing the same page.
      "
      
      We don't particularly care about R/O FOLL_GET references: they were never
      reliable and O_DIRECT doesn't expect to observe modifications from a page
      after DMA was started.
      
      Note that:
      * this only fixes the issue on x86, arm64, s390x and ppc64/book3s
        ("enterprise architectures"). Other architectures have to implement
        __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
      * this does *not * consider any kind of fork() after taking the reference:
        fork() after GUP never worked reliably with FOLL_GET.
      * Not losing PG_anon_exclusive during swapout was the last remaining
        piece. KSM already makes sure that there are no other references on
        a page before considering it for sharing. Page migration maintains
        PG_anon_exclusive and simply fails when there are additional references
        (freezing the refcount fails). Only swapout code dropped the
        PG_anon_exclusive flag because it requires more work to remember +
        restore it.
      
      With this series in place, most COW issues of [3] are fixed on said
      architectures. Other architectures can implement
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.
      
      [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      
      This patch (of 8):
      
      Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
      it.  We do this, to keep fork() logic on swap entries easy and efficient:
      for example, if we wouldn't clear it when unmapping, we'd have to lookup
      the page in the swapcache for each and every swap entry during fork() and
      clear PG_anon_exclusive if set.
      
      Instead, we want to store that information directly in the swap pte,
      protected by the page table lock, similarly to how we handle
      SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
      swap entries, we don't want to mess with the swap type (e.g., still one
      bit) because it overcomplicates swap code.
      
      In try_to_unmap(), we already reject to unmap in case the page might be
      pinned, because we must not lose PG_anon_exclusive on pinned pages ever. 
      Checking if there are other unexpected references reliably *before*
      completely unmapping a page is unfortunately not really possible: THP
      heavily overcomplicate the situation.  Once fully unmapped it's easier --
      we, for example, make sure that there are no unexpected references *after*
      unmapping a page before starting writeback on that page.
      
      So, we currently might end up unmapping a page and clearing
      PG_anon_exclusive if that page has additional references, for example, due
      to a FOLL_GET.
      
      do_swap_page() has to re-determine if a page is exclusive, which will
      easily fail if there are other references on a page, most prominently GUP
      references via FOLL_GET.  This can currently result in memory corruptions
      when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
      is never involved: try_to_unmap() will succeed, and when refaulting the
      page, it cannot be marked exclusive and will get replaced by a copy in the
      page tables on the next write access, resulting in writes via the GUP
      reference to the page being lost.
      
      In an ideal world, everybody that uses GUP and wants to modify page
      content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
      conversion will take a while.  It's easier to fix what used to work in the
      past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
      by remembering PG_anon_exclusive we can further reduce unnecessary COW in
      some cases, so it's the natural thing to do.
      
      So let's transfer the PG_anon_exclusive information to the swap pte and
      store it via an architecture-dependant pte bit; use that information when
      restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
      we simply have to clear the pte bit and are done.
      
      Of course, there is one corner case to handle: swap backends that don't
      support concurrent page modifications while the page is under writeback. 
      Special case these, and drop the exclusive marker.  Add a comment why that
      is just fine (also, reuse_swap_page() would have done the same in the
      past).
      
      In the future, we'll hopefully have all architectures support
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
      and the define completely.  Then, we can also convert
      SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
      support: either simply use a yet unused pte bit that can be used for swap
      entries, steal one from the arch type bits if they exceed 5, or steal one
      from the offset bits.
      
      Note: R/O FOLL_GET references were never really reliable, especially when
      taking one on a shared page and then writing to the page (e.g., GUP after
      fork()).  FOLL_GET, including R/W references, were never really reliable
      once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
      steps back in case it stumbles over unexpected references and is,
      therefore, fine.
      
      [david@redhat.com: fix SWP_STABLE_WRITES test]
        Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1493a191
    • David Hildenbrand's avatar
      mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning · b6a2619c
      David Hildenbrand authored
      Let's verify when (un)pinning anonymous pages that we always deal with
      exclusive anonymous pages, which guarantees that we'll have a reliable
      PIN, meaning that we cannot end up with the GUP pin being inconsistent
      with he pages mapped into the page tables due to a COW triggered by a
      write fault.
      
      When pinning pages, after conditionally triggering GUP unsharing of
      possibly shared anonymous pages, we should always only see exclusive
      anonymous pages.  Note that anonymous pages that are mapped writable must
      be marked exclusive, otherwise we'd have a BUG.
      
      When pinning during ordinary GUP, simply add a check after our conditional
      GUP-triggered unsharing checks.  As we know exactly how the page is
      mapped, we know exactly in which page we have to check for
      PageAnonExclusive().
      
      When pinning via GUP-fast we have to be careful, because we can race with
      fork(): verify only after we made sure via the seqcount that we didn't
      race with concurrent fork() that we didn't end up pinning a possibly
      shared anonymous page.
      
      Similarly, when unpinning, verify that the pages are still marked as
      exclusive: otherwise something turned the pages possibly shared, which can
      result in random memory corruptions, which we really want to catch.
      
      With only the pinned pages at hand and not the actual page table entries
      we have to be a bit careful: hugetlb pages are always mapped via a single
      logical page table entry referencing the head page and PG_anon_exclusive
      of the head page applies.  Anon THP are a bit more complicated, because we
      might have obtained the page reference either via a PMD or a PTE --
      depending on the mapping type we either have to check PageAnonExclusive of
      the head page (PMD-mapped THP) or the tail page (PTE-mapped THP) applies:
      as we don't know and to make our life easier, check that either is set.
      
      Take care to not verify in case we're unpinning during GUP-fast because we
      detected concurrent fork(): we might stumble over an anonymous page that
      is now shared.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-18-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6a2619c
    • David Hildenbrand's avatar
      mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page · a7f22660
      David Hildenbrand authored
      Whenever GUP currently ends up taking a R/O pin on an anonymous page that
      might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
      on the page table entry will end up replacing the mapped anonymous page
      due to COW, resulting in the GUP pin no longer being consistent with the
      page actually mapped into the page table.
      
      The possible ways to deal with this situation are:
       (1) Ignore and pin -- what we do right now.
       (2) Fail to pin -- which would be rather surprising to callers and
           could break user space.
       (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
           pins.
      
      Let's implement 3) because it provides the clearest semantics and allows
      for checking in unpin_user_pages() and friends for possible BUGs: when
      trying to unpin a page that's no longer exclusive, clearly something went
      very wrong and might result in memory corruptions that might be hard to
      debug.  So we better have a nice way to spot such issues.
      
      This change implies that whenever user space *wrote* to a private mapping
      (IOW, we have an anonymous page mapped), that GUP pins will always remain
      consistent: reliable R/O GUP pins of anonymous pages.
      
      As a side note, this commit fixes the COW security issue for hugetlb with
      FOLL_PIN as documented in:
        https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
      instead of FOLL_PIN.
      
      Note that follow_huge_pmd() doesn't apply because we cannot end up in
      there with FOLL_PIN.
      
      This commit is heavily based on prototype patches by Andrea.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Co-developed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7f22660