1. 12 Jul, 2024 13 commits
    • Andrii Nakryiko's avatar
      fs/procfs: extract logic for getting VMA name constituents · acd4b2ec
      Andrii Nakryiko authored
      Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6.
      
      Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
      applications to query VMA information more efficiently than reading *all*
      VMAs nonselectively through text-based interface of /proc/<pid>/maps file.
      
      Patch #2 goes into a lot of details and background on some common patterns
      of using /proc/<pid>/maps in the area of performance profiling and
      subsequent symbolization of captured stack traces.  As mentioned in that
      patch, patterns of VMA querying can differ depending on specific use case,
      but can generally be grouped into two main categories: the need to query a
      small subset of VMAs covering a given batch of addresses, or
      reading/storing/caching all (typically, executable) VMAs upfront for later
      processing.
      
      The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by
      the former pattern of usage.  Earlier revisions had a patch adding a tool
      that faithfully reproduces an efficient VMA matching pass of a symbolizer,
      collecting a subset of covering VMAs for a given set of addresses as
      efficiently as possible.  This tool served both as a testing ground, as
      well as a benchmarking tool.  It implements everything both for currently
      existing text-based /proc/<pid>/maps interface, as well as for newly-added
      PROCMAP_QUERY ioctl().  This revision dropped the tool from the patch set
      and, once the API lands upstream, this tool might be added separately on
      Github as an example.
      
      Based on discussion on earlier revisions of this patch set, it turned out
      that this ioctl() API is competitive with highly-optimized text-based
      pre-processing pattern that perf tool is using.  Based on perf discussion,
      this revision adds more flexibility in specifying a subset of VMAs that
      are of interest.  Now it's possible to specify desired permissions of VMAs
      (e.g., request only executable ones) and/or restrict to only a subset of
      VMAs that have file backing.  This further improves the efficiency when
      using this new API thanks to more selective (executable VMAs only)
      querying.
      
      In addition to a custom benchmarking tool, and experimental perf
      integration (available at [0]), Daniel Mueller has since also implemented
      an experimental integration into blazesym (see [1]), a library used for
      stack trace symbolization by our server fleet-wide profiler and another
      on-device profiler agent that runs on weaker ARM devices.  The latter
      ARM-based device profiler is especially sensitive to performance, and so
      we benchmarked and compared text-based /proc/<pid>/maps solution to the
      equivalent one using PROCMAP_QUERY ioctl().
      
      Results are very encouraging, giving us 5x improvement for end-to-end
      so-called "address normalization" pass, which is the part of the
      symbolization process that happens locally on ARM device, before being
      sent out for further heavier-weight processing on more powerful remote
      server.  Note that this is not an artificial microbenchmark.  It's a full
      end-to-end API call being measured with real-world data on real-world
      device.
      
        TEXT-BASED
        ==========
        Benchmarking main/normalize_process_no_build_ids_uncached_maps
        main/normalize_process_no_build_ids_uncached_maps
      	  time:   [49.777 µs 49.982 µs 50.250 µs]
      
        IOCTL-BASED
        ===========
        Benchmarking main/normalize_process_no_build_ids_uncached_maps
        main/normalize_process_no_build_ids_uncached_maps
      	  time:   [10.328 µs 10.391 µs 10.457 µs]
      	  change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
      	  Performance has improved.
      
      You can see above that we see the drop from 50µs down to 10µs for
      exactly the same amount of work, with the same data and target process.
      
      With the aforementioned custom tool, we see about ~40x improvement (it
      might vary a bit, depending on a specific captured set of addresses).  And
      even for perf-based benchmark it's on par or slightly ahead when using
      permission-based filtering (fetching only executable VMAs).
      
      Earlier revisions attempted to use per-VMA locking, if kernel was compiled
      with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not
      yet compatible with per-VMA locking and assumes mmap_lock to be taken,
      which makes the use of per-VMA locking for this API premature.  It was
      agreed ([2]) to continue for now with just mmap_lock, but the code
      structure is such that it should be easy to add per-VMA locking support
      once all the pieces are ready.
      
      One thing that did not change was basing this new API as an ioctl()
      command on /proc/<pid>/maps file.  An ioctl-based API on top of pidfd was
      considered, but has its own downsides.  Implementing ioctl() directly on
      pidfd will cause access permission checks on every single ioctl(), which
      leads to performance concerns and potential spam of capable() audit
      messages.  It also prevents a nice pattern, possible with
      /proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring
      no additional capabilities) and passed this FD to profiling agent for
      querying.  To achieve similar pattern, a new file would have to be created
      from pidf just for VMA querying, which is considered to be inferior to
      just querying /proc/<pid>/maps FD as proposed in current approach.  These
      aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and
      sticking to procfs ioctl() was the final agreement we arrived at.
      
        [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
        [1] https://github.com/libbpf/blazesym/pull/675
        [2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/
      
      
      This patch (of 6):
      
      Extract generic logic to fetch relevant pieces of data to describe VMA
      name.  This could be just some string (either special constant or
      user-provided), or a string with some formatted wrapping text (e.g.,
      "[anon_shmem:<something>]"), or, commonly, file path.  seq_file-based
      logic has different methods to handle all three cases, but they are
      currently mixed in with extracting underlying sources of data.
      
      This patch splits this into data fetching and data formatting, so that
      data fetching can be reused later on.
      
      There should be no functional changes.
      
      Link: https://lkml.kernel.org/r/20240627170900.1672542-1-andrii@kernel.org
      Link: https://lkml.kernel.org/r/20240627170900.1672542-2-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      acd4b2ec
    • Vivek Kasireddy's avatar
      selftests/udmabuf: add tests to verify data after page migration · 8d42e2a9
      Vivek Kasireddy authored
      Since the memfd pages associated with a udmabuf may be migrated as part of
      udmabuf create, we need to verify the data coherency after successful
      migration.  The new tests added in this patch try to do just that using 4k
      sized pages and also 2 MB sized huge pages for the memfd.
      
      Successful completion of the tests would mean that there is no disconnect
      between the memfd pages and the ones associated with a udmabuf.  And,
      these tests can also be augmented in the future to test newer udmabuf
      features (such as handling memfd hole punch).
      
      The idea for these tests comes from a patch by Mike Kravetz here:
      https://lists.freedesktop.org/archives/dri-devel/2023-June/410623.html
      
      v1->v2: (suggestions from Shuah)
      - Use ksft_* functions to print and capture results of tests
      - Use appropriate KSFT_* status codes for exit()
      - Add Mike Kravetz's suggested-by tag
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-10-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d42e2a9
    • Vivek Kasireddy's avatar
      udmabuf: pin the pages using memfd_pin_folios() API · c6a3194c
      Vivek Kasireddy authored
      Using memfd_pin_folios() will ensure that the pages are pinned
      correctly using FOLL_PIN. And, this also ensures that we don't
      accidentally break features such as memory hotunplug as it would
      not allow pinning pages in the movable zone.
      
      Using this new API also simplifies the code as we no longer have
      to deal with extracting individual pages from their mappings or
      handle shmem and hugetlb cases separately.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-9-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6a3194c
    • Vivek Kasireddy's avatar
      udmabuf: convert udmabuf driver to use folios · 5e72b2b4
      Vivek Kasireddy authored
      This is mainly a preparatory patch to use memfd_pin_folios() API for
      pinning folios.  Using folios instead of pages makes sense as the udmabuf
      driver needs to handle both shmem and hugetlb cases.  And, using the
      memfd_pin_folios() API makes this easier as we no longer need to
      separately handle shmem vs hugetlb cases in the udmabuf driver.
      
      Note that, the function vmap_udmabuf() still needs a list of pages; so, we
      collect all the head pages into a local array in this case.
      
      Other changes in this patch include the addition of helpers for checking
      the memfd seals and exporting dmabuf.  Moving code from udmabuf_create()
      into these helpers improves readability given that udmabuf_create() is a
      bit long.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-8-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e72b2b4
    • Vivek Kasireddy's avatar
      udmabuf: add back support for mapping hugetlb pages · 0c8b91ef
      Vivek Kasireddy authored
      A user or admin can configure a VMM (Qemu) Guest's memory to be backed by
      hugetlb pages for various reasons.  However, a Guest OS would still
      allocate (and pin) buffers that are backed by regular 4k sized pages.  In
      order to map these buffers and create dma-bufs for them on the Host, we
      first need to find the hugetlb pages where the buffer allocations are
      located and then determine the offsets of individual chunks (within those
      pages) and use this information to eventually populate a scatterlist.
      
      Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
      were passed to the Host kernel and Qemu was launched with these
      relevant options: qemu-system-x86_64 -m 4096m....
      -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
      -display gtk,gl=on
      -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
      -machine memory-backend=mem1
      
      Replacing -display gtk,gl=on with -display gtk,gl=off above would
      exercise the mmap handler.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-7-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: Mike Kravetz <mike.kravetz@oracle.com> (v2)
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c8b91ef
    • Vivek Kasireddy's avatar
      udmabuf: use vmf_insert_pfn and VM_PFNMAP for handling mmap · 7d79cd78
      Vivek Kasireddy authored
      Add VM_PFNMAP to vm_flags in the mmap handler to ensure that the mappings
      would be managed without using struct page.
      
      And, in the vm_fault handler, use vmf_insert_pfn to share the page's pfn
      to userspace instead of directly sharing the page (via struct page *).
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-6-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d79cd78
    • Arnd Bergmann's avatar
      udmabuf: add CONFIG_MMU dependency · 725553d2
      Arnd Bergmann authored
      There is no !CONFIG_MMU version of vmf_insert_pfn():
      
      arm-linux-gnueabi-ld: drivers/dma-buf/udmabuf.o: in function `udmabuf_vm_fault':
      udmabuf.c:(.text+0xaa): undefined reference to `vmf_insert_pfn'
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-5-vivek.kasireddy@intel.comSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      725553d2
    • Vivek Kasireddy's avatar
      mm/gup: introduce memfd_pin_folios() for pinning memfd folios · 89c1905d
      Vivek Kasireddy authored
      For drivers that would like to longterm-pin the folios associated with a
      memfd, the memfd_pin_folios() API provides an option to not only pin the
      folios via FOLL_PIN but also to check and migrate them if they reside in
      movable zone or CMA block.  This API currently works with memfds but it
      should work with any files that belong to either shmemfs or hugetlbfs. 
      Files belonging to other filesystems are rejected for now.
      
      The folios need to be located first before pinning them via FOLL_PIN.  If
      they are found in the page cache, they can be immediately pinned. 
      Otherwise, they need to be allocated using the filesystem specific APIs
      and then pinned.
      
      [akpm@linux-foundation.org: improve the CONFIG_MMU=n situation, per SeongJae]
      [vivek.kasireddy@intel.com: return -EINVAL if the end offset is greater than the size of memfd]
        Link: https://lkml.kernel.org/r/IA0PR11MB71850525CBC7D541CAB45DF1F8DB2@IA0PR11MB7185.namprd11.prod.outlook.com
      Link: https://lkml.kernel.org/r/20240624063952.1572359-4-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
      Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
      Reviewed-by: Christoph Hellwig <hch@lst.de> (v6)
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      89c1905d
    • Vivek Kasireddy's avatar
      mm/gup: introduce check_and_migrate_movable_folios() · 53ba78de
      Vivek Kasireddy authored
      This helper is the folio equivalent of check_and_migrate_movable_pages(). 
      Therefore, all the rules that apply to check_and_migrate_movable_pages()
      also apply to this one as well.  Currently, this helper is only used by
      memfd_pin_folios().
      
      This patch also includes changes to rename and convert the internal
      functions collect_longterm_unpinnable_pages() and
      migrate_longterm_unpinnable_pages() to work on folios.  As a result,
      check_and_migrate_movable_pages() is now a wrapper around
      check_and_migrate_movable_folios().
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-3-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53ba78de
    • Vivek Kasireddy's avatar
      mm/gup: introduce unpin_folio/unpin_folios helpers · 6cc04054
      Vivek Kasireddy authored
      Patch series "mm/gup: Introduce memfd_pin_folios() for pinning memfd
      folios", v16.
      
      Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the
      pages/folios associated with a memfd, do so by simply taking a reference
      on them.  This is not desirable because the pages/folios may reside in
      Movable zone or CMA block.
      
      Therefore, having drivers use memfd_pin_folios() API ensures that the
      folios are appropriately pinned via FOLL_PIN for longterm DMA.
      
      This patchset also introduces a few helpers and converts the Udmabuf
      driver to use folios and memfd_pin_folios() API to longterm-pin the folios
      for DMA.  Two new Udmabuf selftests are also included to test the driver
      and the new API.
      
      
      This patch (of 9):
      
      These helpers are the folio versions of unpin_user_page/unpin_user_pages. 
      They are currently only useful for unpinning folios pinned by
      memfd_pin_folios() or other associated routines.  However, they could find
      new uses in the future, when more and more folio-only helpers are added to
      GUP.
      
      We should probably sanity check the folio as part of unpin similar to how
      it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do
      that at the moment without also checking the subpage.  Therefore, sanity
      checking needs to be added to these routines once we have a way to
      determine if any given folio is anon-exclusive (via a per folio
      AnonExclusive flag).
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-1-vivek.kasireddy@intel.com
      Link: https://lkml.kernel.org/r/20240624063952.1572359-2-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cc04054
    • Chengming Zhou's avatar
      mm/zswap: use only one pool in zswap · 8edc9c4e
      Chengming Zhou authored
      Zswap uses 32 pools to workaround the locking scalability problem in zswap
      backends (mainly zsmalloc nowadays), which brings its own problems like
      memory waste and more memory fragmentation.
      
      Testing results show that we can have near performance with only one pool
      in zswap after changing zsmalloc to use per-size_class lock instead of
      pool spinlock.
      
      Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
      zswap shrinker enabled with 10GB swapfile on ext4.
      
                                      real    user    sys
      6.10.0-rc3                      138.18  1241.38 1452.73
      6.10.0-rc3-onepool              149.45  1240.45 1844.69
      6.10.0-rc3-onepool-perclass     138.23  1242.37 1469.71
      
      And do the same testing using zbud, which shows a little worse performance
      as expected since we don't do any locking optimization for zbud.  I think
      it's acceptable since zsmalloc became a lot more popular than other
      backends, and we may want to support only zsmalloc in the future.
      
                                      real    user    sys
      6.10.0-rc3-zbud			138.23  1239.58 1430.09
      6.10.0-rc3-onepool-zbud		139.64  1241.37 1516.59
      
      [chengming.zhou@linux.dev: fix error handling in zswap_pool_create(), per Dan Carpenter]
        Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-2-d30e9cd2b793@linux.dev
      [chengming.zhou@linux.dev: fix error handling again in zswap_pool_create(), per Yosry]
        Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-2-ad941699cb61@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-2-5e5081ea11b3@linux.devSigned-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Chengming Zhou <zhouchengming@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8edc9c4e
    • Chengming Zhou's avatar
      mm/zsmalloc: change back to per-size_class lock · 64bd0197
      Chengming Zhou authored
      Patch series "mm/zsmalloc: change back to per-size_class lock, v2".
      
      Commit c0547d0b ("zsmalloc: consolidate zs_pool's migrate_lock and
      size_class's locks") changed per-size_class lock to pool spinlock to
      prepare reclaim support in zsmalloc.  Then reclaim support in zsmalloc had
      been dropped in favor of LRU reclaim in zswap, but this locking change had
      been left there.
      
      Obviously, the scalability of pool spinlock is worse than per-size_class. 
      And we have a workaround that using 32 pools in zswap to avoid this
      scalability problem, which brings its own problems like memory waste and
      more memory fragmentation.
      
      So this series changes back to use per-size_class lock and using testing
      data in much stressed situation to verify that we can use only one pool in
      zswap.  Note we only test and care about the zsmalloc backend, which makes
      sense now since zsmalloc became a lot more popular than other backends.
      
      Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
      zswap shrinker enabled with 10GB swapfile on ext4.
      
      				real	user    sys
      6.10.0-rc3			138.18	1241.38 1452.73
      6.10.0-rc3-onepool		149.45	1240.45 1844.69
      6.10.0-rc3-onepool-perclass	138.23	1242.37 1469.71
      
      We can see from "sys" column that per-size_class locking with only one
      pool in zswap can have near performance with the current 32 pools.
      
      
      This patch (of 2):
      
      This patch is almost the revert of the commit c0547d0b ("zsmalloc:
      consolidate zs_pool's migrate_lock and size_class's locks"), which changed
      to use a global pool->lock instead of per-size_class lock and
      pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc. 
      Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in
      zswap.
      
      In theory, per-size_class is more fine-grained than the pool->lock, since
      a pool can have many size_classes.  As for the additional
      pool->migrate_lock, only free() and map() need to grab it to access stable
      handle to get zspage, and only in read lock mode.
      
      Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-0-ad941699cb61@linux.dev
      Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-0-d30e9cd2b793@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-0-5e5081ea11b3@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-1-5e5081ea11b3@linux.devSigned-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64bd0197
    • Andrew Morton's avatar
      mm/hugetlb.c: undo errant change · 998d4e2c
      Andrew Morton authored
      During conflict resolution a line was unintentionally removed by a ksm.c
      patch.
      
      Link: https://lkml.kernel.org/r/85b0d694-d1ac-8e7a-2e50-1edc03eee21a@google.com
      Fixes: ac90c56b ("mm/ksm: refactor out try_to_merge_with_zero_page()")
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      998d4e2c
  2. 10 Jul, 2024 20 commits
  3. 06 Jul, 2024 7 commits
    • Kefeng Wang's avatar
      mm: migrate: remove folio_migrate_copy() · 3f594937
      Kefeng Wang authored
      The folio_migrate_copy() is just a wrapper of folio_copy() and
      folio_migrate_flags(), it is simple and only aio use it for now, unfold it
      and remove folio_migrate_copy().
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-7-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f594937
    • Kefeng Wang's avatar
      fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio() · f00b295b
      Kefeng Wang authored
      This is similar to __migrate_folio(), use folio_mc_copy() in HugeTLB folio
      migration to avoid panic when copy from poisoned folio.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f00b295b
    • Kefeng Wang's avatar
      mm: migrate: support poisoned recover from migrate folio · 06091399
      Kefeng Wang authored
      The folio migration is widely used in kernel, memory compaction, memory
      hotplug, soft offline page, numa balance, memory demote/promotion, etc,
      but once access a poisoned source folio when migrating, the kerenl will
      panic.
      
      There is a mechanism in the kernel to recover from uncorrectable memory
      errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths,
      eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel},
      copy_mc_{user_}highpage callers.
      
      In order to support poisoned folio copy recover from migrate folio, we
      chose to make folio migration tolerant of memory failures and return error
      for folio migration, because folio migration is no guarantee of success,
      this could avoid the similar panic shown below.
      
        CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
        pc : copy_page+0x10/0xc0
        lr : copy_highpage+0x38/0x50
        ...
        Call trace:
         copy_page+0x10/0xc0
         folio_copy+0x78/0x90
         migrate_folio_extra+0x54/0xa0
         move_to_new_folio+0xd8/0x1f0
         migrate_folio_move+0xb8/0x300
         migrate_pages_batch+0x528/0x788
         migrate_pages_sync+0x8c/0x258
         migrate_pages+0x440/0x528
         soft_offline_in_use_page+0x2ec/0x3c0
         soft_offline_page+0x238/0x310
         soft_offline_page_store+0x6c/0xc0
         dev_attr_store+0x20/0x40
         sysfs_kf_write+0x4c/0x68
         kernfs_fop_write_iter+0x130/0x1c8
         new_sync_write+0xa4/0x138
         vfs_write+0x238/0x2d8
         ksys_write+0x74/0x110
      
      Note, folio copy is moved in the begin of the __migrate_folio(), which
      could simplify the error handling since there is no turning back if
      folio_migrate_mapping() return success, the downside is the folio copied
      even though folio_migrate_mapping() return fail, an optimization is to
      check whether source folio does not have extra refs before we do folio
      copy.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06091399
    • Kefeng Wang's avatar
      mm: migrate: split folio_migrate_mapping() · 52881539
      Kefeng Wang authored
      The folio refcount check is moved out for both !mapping and mapping folio,
      also update comment from page to folio for folio_migrate_mapping().
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52881539
    • Kefeng Wang's avatar
      mm: add folio_mc_copy() · 02f4ee5a
      Kefeng Wang authored
      Add a #MC variant of folio_copy() which uses copy_mc_highpage() to support
      #MC handled during folio copy, it will be used in folio migration soon.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      02f4ee5a
    • Kefeng Wang's avatar
      mm: move memory_failure_queue() into copy_mc_[user]_highpage() · 28bdacbc
      Kefeng Wang authored
      Patch series "mm: migrate: support poison recover from migrate folio", v5.
      
      The folio migration is widely used in kernel, memory compaction, memory
      hotplug, soft offline page, numa balance, memory demote/promotion, etc,
      but once access a poisoned source folio when migrating, the kernel will
      panic.
      
      There is a mechanism in the kernel to recover from uncorrectable memory
      errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which
      is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump,
      ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers.
      
      This series of patches provide the recovery mechanism from folio copy for
      the widely used folio migration.  Please note, because folio migration is
      no guarantee of success, so we could chose to make folio migration
      tolerant of memory failures, adding folio_mc_copy() which is a #MC
      versions of folio_copy(), once accessing a poisoned source folio, we could
      return error and make the folio migration fail, and this could avoid the
      similar panic shown below.
      
        CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
        pc : copy_page+0x10/0xc0
        lr : copy_highpage+0x38/0x50
        ...
        Call trace:
         copy_page+0x10/0xc0
         folio_copy+0x78/0x90
         migrate_folio_extra+0x54/0xa0
         move_to_new_folio+0xd8/0x1f0
         migrate_folio_move+0xb8/0x300
         migrate_pages_batch+0x528/0x788
         migrate_pages_sync+0x8c/0x258
         migrate_pages+0x440/0x528
         soft_offline_in_use_page+0x2ec/0x3c0
         soft_offline_page+0x238/0x310
         soft_offline_page_store+0x6c/0xc0
         dev_attr_store+0x20/0x40
         sysfs_kf_write+0x4c/0x68
         kernfs_fop_write_iter+0x130/0x1c8
         new_sync_write+0xa4/0x138
         vfs_write+0x238/0x2d8
         ksys_write+0x74/0x110
      
      
      This patch (of 5):
      
      There is a memory_failure_queue() call after copy_mc_[user]_highpage(),
      see callers, eg, CoW/KSM page copy, it is used to mark the source page as
      h/w poisoned and unmap it from other tasks, and the upcomming poison
      recover from migrate folio will do the similar thing, so let's move the
      memory_failure_queue() into the copy_mc_[user]_highpage() instead of
      adding it into each user, this should also enhance the handling of
      poisoned page in khugepaged.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240626085328.608006-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28bdacbc
    • Andrew Morton's avatar
      Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix · 8ef6fd0e
      Andrew Morton authored
      crashes from deferred split racing folio migration", needed by "mm:
      migrate: split folio_migrate_mapping()".
      8ef6fd0e