Commit f41f2ed4 authored by Muchun Song's avatar Muchun Song Committed by Linus Torvalds

mm: hugetlb: free the vmemmap pages associated with each HugeTLB page

Every HugeTLB has more than one struct page structure.  We __know__ that
we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
store metadata associated with each HugeTLB.

There are a lot of struct page structures associated with each HugeTLB
page.  For tail pages, the value of compound_head is the same.  So we can
reuse first page of tail page structures.  We map the virtual addresses of
the remaining pages of tail page structures to the first tail page struct,
and then free these page frames.  Therefore, we need to reserve two pages
as vmemmap areas.

When we allocate a HugeTLB page from the buddy, we can free some vmemmap
pages associated with each HugeTLB page.  It is more appropriate to do it
in the prep_new_huge_page().

The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
associated with a HugeTLB page can be freed, returns zero for now, which
means the feature is disabled.  We will enable it once all the
infrastructure is there.

[willy@infradead.org: fix documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
Acked-by: default avatarMichal Hocko <mhocko@suse.com>
Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent cd39d4e9
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
#ifndef __LINUX_BOOTMEM_INFO_H #ifndef __LINUX_BOOTMEM_INFO_H
#define __LINUX_BOOTMEM_INFO_H #define __LINUX_BOOTMEM_INFO_H
#include <linux/mmzone.h> #include <linux/mm.h>
/* /*
* Types for free bootmem stored in page->lru.next. These have to be in * Types for free bootmem stored in page->lru.next. These have to be in
...@@ -22,6 +22,27 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat); ...@@ -22,6 +22,27 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
void get_page_bootmem(unsigned long info, struct page *page, void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type); unsigned long type);
void put_page_bootmem(struct page *page); void put_page_bootmem(struct page *page);
/*
* Any memory allocated via the memblock allocator and not via the
* buddy will be marked reserved already in the memmap. For those
* pages, we can call this function to free it to buddy allocator.
*/
static inline void free_bootmem_page(struct page *page)
{
unsigned long magic = (unsigned long)page->freelist;
/*
* The reserve_bootmem_region sets the reserved flag on bootmem
* pages.
*/
VM_BUG_ON_PAGE(page_ref_count(page) != 2, page);
if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
put_page_bootmem(page);
else
VM_BUG_ON_PAGE(1, page);
}
#else #else
static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
{ {
...@@ -35,6 +56,11 @@ static inline void get_page_bootmem(unsigned long info, struct page *page, ...@@ -35,6 +56,11 @@ static inline void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type) unsigned long type)
{ {
} }
static inline void free_bootmem_page(struct page *page)
{
free_reserved_page(page);
}
#endif #endif
#endif /* __LINUX_BOOTMEM_INFO_H */ #endif /* __LINUX_BOOTMEM_INFO_H */
...@@ -3076,6 +3076,9 @@ static inline void print_vma_addr(char *prefix, unsigned long rip) ...@@ -3076,6 +3076,9 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
} }
#endif #endif
void vmemmap_remap_free(unsigned long start, unsigned long end,
unsigned long reuse);
void *sparse_buffer_alloc(unsigned long size); void *sparse_buffer_alloc(unsigned long size);
struct page * __populate_section_memmap(unsigned long pfn, struct page * __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap); unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
......
...@@ -75,6 +75,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o ...@@ -75,6 +75,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_ZSWAP) += zswap.o obj-$(CONFIG_ZSWAP) += zswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) += hugetlb_vmemmap.o
obj-$(CONFIG_NUMA) += mempolicy.o obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SPARSEMEM) += sparse.o obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
......
...@@ -41,6 +41,7 @@ ...@@ -41,6 +41,7 @@
#include <linux/node.h> #include <linux/node.h>
#include <linux/page_owner.h> #include <linux/page_owner.h>
#include "internal.h" #include "internal.h"
#include "hugetlb_vmemmap.h"
int hugetlb_max_hstate __read_mostly; int hugetlb_max_hstate __read_mostly;
unsigned int default_hstate_idx; unsigned int default_hstate_idx;
...@@ -1493,8 +1494,9 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid) ...@@ -1493,8 +1494,9 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
h->nr_huge_pages_node[nid]++; h->nr_huge_pages_node[nid]++;
} }
static void __prep_new_huge_page(struct page *page) static void __prep_new_huge_page(struct hstate *h, struct page *page)
{ {
free_huge_page_vmemmap(h, page);
INIT_LIST_HEAD(&page->lru); INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
hugetlb_set_page_subpool(page, NULL); hugetlb_set_page_subpool(page, NULL);
...@@ -1504,7 +1506,7 @@ static void __prep_new_huge_page(struct page *page) ...@@ -1504,7 +1506,7 @@ static void __prep_new_huge_page(struct page *page)
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
{ {
__prep_new_huge_page(page); __prep_new_huge_page(h, page);
spin_lock_irq(&hugetlb_lock); spin_lock_irq(&hugetlb_lock);
__prep_account_new_huge_page(h, nid); __prep_account_new_huge_page(h, nid);
spin_unlock_irq(&hugetlb_lock); spin_unlock_irq(&hugetlb_lock);
...@@ -2351,14 +2353,15 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, ...@@ -2351,14 +2353,15 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
/* /*
* Before dissolving the page, we need to allocate a new one for the * Before dissolving the page, we need to allocate a new one for the
* pool to remain stable. Using alloc_buddy_huge_page() allows us to * pool to remain stable. Here, we allocate the page and 'prep' it
* not having to deal with prep_new_huge_page() and avoids dealing of any * by doing everything but actually updating counters and adding to
* counters. This simplifies and let us do the whole thing under the * the pool. This simplifies and let us do most of the processing
* lock. * under the lock.
*/ */
new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
if (!new_page) if (!new_page)
return -ENOMEM; return -ENOMEM;
__prep_new_huge_page(h, new_page);
retry: retry:
spin_lock_irq(&hugetlb_lock); spin_lock_irq(&hugetlb_lock);
...@@ -2397,14 +2400,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, ...@@ -2397,14 +2400,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
remove_hugetlb_page(h, old_page, false); remove_hugetlb_page(h, old_page, false);
/* /*
* new_page needs to be initialized with the standard hugetlb
* state. This is normally done by prep_new_huge_page() but
* that takes hugetlb_lock which is already held so we need to
* open code it here.
* Reference count trick is needed because allocator gives us * Reference count trick is needed because allocator gives us
* referenced page but the pool requires pages with 0 refcount. * referenced page but the pool requires pages with 0 refcount.
*/ */
__prep_new_huge_page(new_page);
__prep_account_new_huge_page(h, nid); __prep_account_new_huge_page(h, nid);
page_ref_dec(new_page); page_ref_dec(new_page);
enqueue_huge_page(h, new_page); enqueue_huge_page(h, new_page);
...@@ -2420,7 +2418,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, ...@@ -2420,7 +2418,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
free_new: free_new:
spin_unlock_irq(&hugetlb_lock); spin_unlock_irq(&hugetlb_lock);
__free_pages(new_page, huge_page_order(h)); update_and_free_page(h, new_page);
return ret; return ret;
} }
......
This diff is collapsed.
// SPDX-License-Identifier: GPL-2.0
/*
* Free some vmemmap pages of HugeTLB
*
* Copyright (c) 2020, Bytedance. All rights reserved.
*
* Author: Muchun Song <songmuchun@bytedance.com>
*/
#ifndef _LINUX_HUGETLB_VMEMMAP_H
#define _LINUX_HUGETLB_VMEMMAP_H
#include <linux/hugetlb.h>
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
void free_huge_page_vmemmap(struct hstate *h, struct page *head);
#else
static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
{
}
#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
#endif /* _LINUX_HUGETLB_VMEMMAP_H */
...@@ -27,8 +27,202 @@ ...@@ -27,8 +27,202 @@
#include <linux/spinlock.h> #include <linux/spinlock.h>
#include <linux/vmalloc.h> #include <linux/vmalloc.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/pgtable.h>
#include <linux/bootmem_info.h>
#include <asm/dma.h> #include <asm/dma.h>
#include <asm/pgalloc.h> #include <asm/pgalloc.h>
#include <asm/tlbflush.h>
/**
* struct vmemmap_remap_walk - walk vmemmap page table
*
* @remap_pte: called for each lowest-level entry (PTE).
* @reuse_page: the page which is reused for the tail vmemmap pages.
* @reuse_addr: the virtual address of the @reuse_page page.
* @vmemmap_pages: the list head of the vmemmap pages that can be freed.
*/
struct vmemmap_remap_walk {
void (*remap_pte)(pte_t *pte, unsigned long addr,
struct vmemmap_remap_walk *walk);
struct page *reuse_page;
unsigned long reuse_addr;
struct list_head *vmemmap_pages;
};
static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end,
struct vmemmap_remap_walk *walk)
{
pte_t *pte = pte_offset_kernel(pmd, addr);
/*
* The reuse_page is found 'first' in table walk before we start
* remapping (which is calling @walk->remap_pte).
*/
if (!walk->reuse_page) {
walk->reuse_page = pte_page(*pte);
/*
* Because the reuse address is part of the range that we are
* walking, skip the reuse address range.
*/
addr += PAGE_SIZE;
pte++;
}
for (; addr != end; addr += PAGE_SIZE, pte++)
walk->remap_pte(pte, addr, walk);
}
static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
unsigned long end,
struct vmemmap_remap_walk *walk)
{
pmd_t *pmd;
unsigned long next;
pmd = pmd_offset(pud, addr);
do {
BUG_ON(pmd_leaf(*pmd));
next = pmd_addr_end(addr, end);
vmemmap_pte_range(pmd, addr, next, walk);
} while (pmd++, addr = next, addr != end);
}
static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end,
struct vmemmap_remap_walk *walk)
{
pud_t *pud;
unsigned long next;
pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
vmemmap_pmd_range(pud, addr, next, walk);
} while (pud++, addr = next, addr != end);
}
static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
unsigned long end,
struct vmemmap_remap_walk *walk)
{
p4d_t *p4d;
unsigned long next;
p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
vmemmap_pud_range(p4d, addr, next, walk);
} while (p4d++, addr = next, addr != end);
}
static void vmemmap_remap_range(unsigned long start, unsigned long end,
struct vmemmap_remap_walk *walk)
{
unsigned long addr = start;
unsigned long next;
pgd_t *pgd;
VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
vmemmap_p4d_range(pgd, addr, next, walk);
} while (pgd++, addr = next, addr != end);
/*
* We only change the mapping of the vmemmap virtual address range
* [@start + PAGE_SIZE, end), so we only need to flush the TLB which
* belongs to the range.
*/
flush_tlb_kernel_range(start + PAGE_SIZE, end);
}
/*
* Free a vmemmap page. A vmemmap page can be allocated from the memblock
* allocator or buddy allocator. If the PG_reserved flag is set, it means
* that it allocated from the memblock allocator, just free it via the
* free_bootmem_page(). Otherwise, use __free_page().
*/
static inline void free_vmemmap_page(struct page *page)
{
if (PageReserved(page))
free_bootmem_page(page);
else
__free_page(page);
}
/* Free a list of the vmemmap pages */
static void free_vmemmap_page_list(struct list_head *list)
{
struct page *page, *next;
list_for_each_entry_safe(page, next, list, lru) {
list_del(&page->lru);
free_vmemmap_page(page);
}
}
static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
struct vmemmap_remap_walk *walk)
{
/*
* Remap the tail pages as read-only to catch illegal write operation
* to the tail pages.
*/
pgprot_t pgprot = PAGE_KERNEL_RO;
pte_t entry = mk_pte(walk->reuse_page, pgprot);
struct page *page = pte_page(*pte);
list_add(&page->lru, walk->vmemmap_pages);
set_pte_at(&init_mm, addr, pte, entry);
}
/**
* vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
* to the page which @reuse is mapped to, then free vmemmap
* which the range are mapped to.
* @start: start address of the vmemmap virtual address range that we want
* to remap.
* @end: end address of the vmemmap virtual address range that we want to
* remap.
* @reuse: reuse address.
*
* Note: This function depends on vmemmap being base page mapped. Please make
* sure that we disable PMD mapping of vmemmap pages when calling this function.
*/
void vmemmap_remap_free(unsigned long start, unsigned long end,
unsigned long reuse)
{
LIST_HEAD(vmemmap_pages);
struct vmemmap_remap_walk walk = {
.remap_pte = vmemmap_remap_pte,
.reuse_addr = reuse,
.vmemmap_pages = &vmemmap_pages,
};
/*
* In order to make remapping routine most efficient for the huge pages,
* the routine of vmemmap page table walking has the following rules
* (see more details from the vmemmap_pte_range()):
*
* - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
* should be continuous.
* - The @reuse address is part of the range [@reuse, @end) that we are
* walking which is passed to vmemmap_remap_range().
* - The @reuse address is the first in the complete range.
*
* So we need to make sure that @start and @reuse meet the above rules.
*/
BUG_ON(start - reuse != PAGE_SIZE);
vmemmap_remap_range(reuse, end, &walk);
free_vmemmap_page_list(&vmemmap_pages);
}
/* /*
* Allocate a block of memory to be used to back the virtual memory map * Allocate a block of memory to be used to back the virtual memory map
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment