Commit 932f4a63 authored by Ira Weiny's avatar Ira Weiny Committed by Linus Torvalds

mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM

Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent a222f341
......@@ -141,8 +141,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
ret = get_user_pages_longterm(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE, mem->hpages + entry, NULL);
ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
mem->hpages + entry, NULL);
if (ret == n) {
pinned += n;
continue;
......
......@@ -295,10 +295,11 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
while (npages) {
down_read(&mm->mmap_sem);
ret = get_user_pages_longterm(cur_base,
ret = get_user_pages(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page *)),
gup_flags, page_list, NULL);
gup_flags | FOLL_LONGTERM,
page_list, NULL);
if (ret < 0) {
up_read(&mm->mmap_sem);
goto umem_release;
......
......@@ -114,10 +114,10 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
down_read(&current->mm->mmap_sem);
for (got = 0; got < num_pages; got += ret) {
ret = get_user_pages_longterm(start_page + got * PAGE_SIZE,
num_pages - got,
FOLL_WRITE | FOLL_FORCE,
p + got, NULL);
ret = get_user_pages(start_page + got * PAGE_SIZE,
num_pages - got,
FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
p + got, NULL);
if (ret < 0) {
up_read(&current->mm->mmap_sem);
goto bail_release;
......
......@@ -143,10 +143,11 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
ret = 0;
while (npages) {
ret = get_user_pages_longterm(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof(struct page *)),
gup_flags, page_list, NULL);
ret = get_user_pages(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof(struct page *)),
gup_flags | FOLL_LONGTERM,
page_list, NULL);
if (ret < 0)
goto out;
......
......@@ -186,12 +186,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
flags, dma->pages, NULL);
err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
flags | FOLL_LONGTERM, dma->pages, NULL);
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
dprintk(1, "get_user_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
......
......@@ -358,7 +358,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
down_read(&mm->mmap_sem);
if (mm == current->mm) {
ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas);
ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
vmas);
} else {
ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
vmas, NULL);
......
......@@ -2697,8 +2697,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
ret = 0;
down_read(&current->mm->mmap_sem);
pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
pages, vmas);
pret = get_user_pages(ubuf, nr_pages,
FOLL_WRITE | FOLL_LONGTERM,
pages, vmas);
if (pret == nr_pages) {
/* don't support file backed memory */
for (j = 0; j < nr_pages; j++) {
......
......@@ -1505,19 +1505,6 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
#if defined(CONFIG_FS_DAX) || defined(CONFIG_CMA)
long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
#else
static inline long get_user_pages_longterm(unsigned long start,
unsigned long nr_pages, unsigned int gup_flags,
struct page **pages, struct vm_area_struct **vmas)
{
return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
}
#endif /* CONFIG_FS_DAX */
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
......@@ -2583,6 +2570,34 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
#define FOLL_COW 0x4000 /* internal GUP flag */
#define FOLL_ANON 0x8000 /* don't do file mappings */
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
/*
* NOTE on FOLL_LONGTERM:
*
* FOLL_LONGTERM indicates that the page will be held for an indefinite time
* period _often_ under userspace control. This is contrasted with
* iov_iter_get_pages() where usages which are transient.
*
* FIXME: For pages which are part of a filesystem, mappings are subject to the
* lifetime enforced by the filesystem and we need guarantees that longterm
* users like RDMA and V4L2 only establish mappings which coordinate usage with
* the filesystem. Ideas for this coordination include revoking the longterm
* pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
* added after the problem with filesystems was found FS DAX VMAs are
* specifically failed. Filesystem pages are still subject to bugs and use of
* FOLL_LONGTERM should be avoided on those pages.
*
* FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
* Currently only get_user_pages() and get_user_pages_fast() support this flag
* and calls to get_user_pages_[un]locked are specifically not allowed. This
* is due to an incompatibility with the FS DAX check and
* FAULT_FLAG_ALLOW_RETRY
*
* In the CMA case: longterm pins in a CMA region would unnecessarily fragment
* that region. And so CMA attempts to migrate the page before pinning when
* FOLL_LONGTERM is specified.
*/
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
{
......
......@@ -1018,6 +1018,15 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
int *locked)
{
/*
* FIXME: Current FOLL_LONGTERM behavior is incompatible with
* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
* vmas. As there are no users of this flag in this call we simply
* disallow this option for now.
*/
if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
return -EINVAL;
return __get_user_pages_locked(current, current->mm, start, nr_pages,
pages, NULL, locked,
gup_flags | FOLL_TOUCH);
......@@ -1046,6 +1055,15 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
int locked = 1;
long ret;
/*
* FIXME: Current FOLL_LONGTERM behavior is incompatible with
* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
* vmas. As there are no users of this flag in this call we simply
* disallow this option for now.
*/
if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
return -EINVAL;
down_read(&mm->mmap_sem);
ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
&locked, gup_flags | FOLL_TOUCH);
......@@ -1116,32 +1134,22 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *locked)
{
/*
* FIXME: Current FOLL_LONGTERM behavior is incompatible with
* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
* vmas. As there are no users of this flag in this call we simply
* disallow this option for now.
*/
if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
return -EINVAL;
return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
locked,
gup_flags | FOLL_TOUCH | FOLL_REMOTE);
}
EXPORT_SYMBOL(get_user_pages_remote);
/*
* This is the same as get_user_pages_remote(), just with a
* less-flexible calling convention where we assume that the task
* and mm being operated on are the current task's and don't allow
* passing of a locked parameter. We also obviously don't pass
* FOLL_REMOTE in here.
*/
long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas)
{
return __get_user_pages_locked(current, current->mm, start, nr_pages,
pages, vmas, NULL,
gup_flags | FOLL_TOUCH);
}
EXPORT_SYMBOL(get_user_pages);
#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
#ifdef CONFIG_FS_DAX
static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
{
long i;
......@@ -1160,12 +1168,6 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
}
return false;
}
#else
static inline bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
{
return false;
}
#endif
#ifdef CONFIG_CMA
static struct page *new_non_cma_page(struct page *page, unsigned long private)
......@@ -1219,10 +1221,13 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private)
return __alloc_pages_node(nid, gfp_mask, 0);
}
static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
unsigned int gup_flags,
static long check_and_migrate_cma_pages(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
struct page **pages,
struct vm_area_struct **vmas)
struct vm_area_struct **vmas,
unsigned int gup_flags)
{
long i;
bool drain_allow = true;
......@@ -1278,10 +1283,14 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
putback_movable_pages(&cma_page_list);
}
/*
* We did migrate all the pages, Try to get the page references again
* migrating any new CMA pages which we failed to isolate earlier.
* We did migrate all the pages, Try to get the page references
* again migrating any new CMA pages which we failed to isolate
* earlier.
*/
nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
nr_pages = __get_user_pages_locked(tsk, mm, start, nr_pages,
pages, vmas, NULL,
gup_flags);
if ((nr_pages > 0) && migrate_allow) {
drain_allow = true;
goto check_again;
......@@ -1291,66 +1300,101 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
return nr_pages;
}
#else
static inline long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
unsigned int gup_flags,
struct page **pages,
struct vm_area_struct **vmas)
static long check_and_migrate_cma_pages(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
struct page **pages,
struct vm_area_struct **vmas,
unsigned int gup_flags)
{
return nr_pages;
}
#endif
/*
* This is the same as get_user_pages() in that it assumes we are
* operating on the current task's mm, but it goes further to validate
* that the vmas associated with the address range are suitable for
* longterm elevated page reference counts. For example, filesystem-dax
* mappings are subject to the lifetime enforced by the filesystem and
* we need guarantees that longterm users like RDMA and V4L2 only
* establish mappings that have a kernel enforced revocation mechanism.
*
* "longterm" == userspace controlled elevated page count lifetime.
* Contrast this to iov_iter_get_pages() usages which are transient.
* __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
* allows us to process the FOLL_LONGTERM flag.
*/
long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas_arg)
static long __gup_longterm_locked(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
struct page **pages,
struct vm_area_struct **vmas,
unsigned int gup_flags)
{
struct vm_area_struct **vmas = vmas_arg;
unsigned long flags;
struct vm_area_struct **vmas_tmp = vmas;
unsigned long flags = 0;
long rc, i;
if (!pages)
return -EINVAL;
if (!vmas) {
vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *),
GFP_KERNEL);
if (!vmas)
return -ENOMEM;
if (gup_flags & FOLL_LONGTERM) {
if (!pages)
return -EINVAL;
if (!vmas_tmp) {
vmas_tmp = kcalloc(nr_pages,
sizeof(struct vm_area_struct *),
GFP_KERNEL);
if (!vmas_tmp)
return -ENOMEM;
}
flags = memalloc_nocma_save();
}
flags = memalloc_nocma_save();
rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
memalloc_nocma_restore(flags);
if (rc < 0)
goto out;
rc = __get_user_pages_locked(tsk, mm, start, nr_pages, pages,
vmas_tmp, NULL, gup_flags);
if (check_dax_vmas(vmas, rc)) {
for (i = 0; i < rc; i++)
put_page(pages[i]);
rc = -EOPNOTSUPP;
goto out;
if (gup_flags & FOLL_LONGTERM) {
memalloc_nocma_restore(flags);
if (rc < 0)
goto out;
if (check_dax_vmas(vmas_tmp, rc)) {
for (i = 0; i < rc; i++)
put_page(pages[i]);
rc = -EOPNOTSUPP;
goto out;
}
rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages,
vmas_tmp, gup_flags);
}
rc = check_and_migrate_cma_pages(start, rc, gup_flags, pages, vmas);
out:
if (vmas != vmas_arg)
kfree(vmas);
if (vmas_tmp != vmas)
kfree(vmas_tmp);
return rc;
}
EXPORT_SYMBOL(get_user_pages_longterm);
#endif /* CONFIG_FS_DAX */
#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
struct page **pages,
struct vm_area_struct **vmas,
unsigned int flags)
{
return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
NULL, flags);
}
#endif /* CONFIG_FS_DAX || CONFIG_CMA */
/*
* This is the same as get_user_pages_remote(), just with a
* less-flexible calling convention where we assume that the task
* and mm being operated on are the current task's and don't allow
* passing of a locked parameter. We also obviously don't pass
* FOLL_REMOTE in here.
*/
long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas)
{
return __gup_longterm_locked(current, current->mm, start, nr_pages,
pages, vmas, gup_flags | FOLL_TOUCH);
}
EXPORT_SYMBOL(get_user_pages);
/**
* populate_vma_page_range() - populate a range of pages in the vma.
......
......@@ -54,8 +54,9 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
pages + i);
break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages_longterm(addr, nr, gup->flags & 1,
pages + i, NULL);
nr = get_user_pages(addr, nr,
(gup->flags & 1) | FOLL_LONGTERM,
pages + i, NULL);
break;
case GUP_BENCHMARK:
nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
......
......@@ -253,8 +253,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
down_read(&current->mm->mmap_sem);
npgs = get_user_pages_longterm(umem->address, umem->npgs,
gup_flags, &umem->pgs[0], NULL);
npgs = get_user_pages(umem->address, umem->npgs,
gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
up_read(&current->mm->mmap_sem);
if (npgs != umem->npgs) {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment