Commit bcfe06bf authored by Roman Gushchin's avatar Roman Gushchin Committed by Alexei Starovoitov

mm: memcontrol: Use helpers to read page's memcg data

Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter.  Pages with a type set can't be mapped to
userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace.  It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions.  As the
result the code became more robust with fewer open-coded bit tricks.

This patch (of 4):

Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information.  In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.

This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
  struct mem_cgroup *page_memcg(struct page *page);
  struct mem_cgroup *page_memcg_rcu(struct page *page);
  struct mem_cgroup *page_memcg_check(struct page *page);

page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector.  It does
check the lowest bit, and if set, returns NULL.  page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Acked-by: default avatarMichal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
parent 9e83f54f
......@@ -657,7 +657,7 @@ int __set_page_dirty_buffers(struct page *page)
} while (bh != head);
}
/*
* Lock out page->mem_cgroup migration to keep PageDirty
* Lock out page's memcg migration to keep PageDirty
* synchronized with per-memcg dirty page counters.
*/
lock_page_memcg(page);
......
......@@ -650,7 +650,7 @@ iomap_set_page_dirty(struct page *page)
return !TestSetPageDirty(page);
/*
* Lock out page->mem_cgroup migration to keep PageDirty
* Lock out page's memcg migration to keep PageDirty
* synchronized with per-memcg dirty page counters.
*/
lock_page_memcg(page);
......
......@@ -343,6 +343,79 @@ struct mem_cgroup {
extern struct mem_cgroup *root_mem_cgroup;
/*
* page_memcg - get the memory cgroup associated with a page
* @page: a pointer to the page struct
*
* Returns a pointer to the memory cgroup associated with the page,
* or NULL. This function assumes that the page is known to have a
* proper memory cgroup pointer. It's not safe to call this function
* against some type of pages, e.g. slab pages or ex-slab pages.
*
* Any of the following ensures page and memcg binding stability:
* - the page lock
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
*/
static inline struct mem_cgroup *page_memcg(struct page *page)
{
VM_BUG_ON_PAGE(PageSlab(page), page);
return (struct mem_cgroup *)page->memcg_data;
}
/*
* page_memcg_rcu - locklessly get the memory cgroup associated with a page
* @page: a pointer to the page struct
*
* Returns a pointer to the memory cgroup associated with the page,
* or NULL. This function assumes that the page is known to have a
* proper memory cgroup pointer. It's not safe to call this function
* against some type of pages, e.g. slab pages or ex-slab pages.
*/
static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
{
VM_BUG_ON_PAGE(PageSlab(page), page);
WARN_ON_ONCE(!rcu_read_lock_held());
return (struct mem_cgroup *)READ_ONCE(page->memcg_data);
}
/*
* page_memcg_check - get the memory cgroup associated with a page
* @page: a pointer to the page struct
*
* Returns a pointer to the memory cgroup associated with the page,
* or NULL. This function unlike page_memcg() can take any page
* as an argument. It has to be used in cases when it's not known if a page
* has an associated memory cgroup pointer or an object cgroups vector.
*
* Any of the following ensures page and memcg binding stability:
* - the page lock
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
*/
static inline struct mem_cgroup *page_memcg_check(struct page *page)
{
/*
* Because page->memcg_data might be changed asynchronously
* for slab pages, READ_ONCE() should be used here.
*/
unsigned long memcg_data = READ_ONCE(page->memcg_data);
/*
* The lowest bit set means that memcg isn't a valid
* memcg pointer, but a obj_cgroups pointer.
* In this case the page is shared and doesn't belong
* to any specific memory cgroup.
*/
if (memcg_data & 0x1UL)
return NULL;
return (struct mem_cgroup *)memcg_data;
}
static __always_inline bool memcg_stat_item_in_bytes(int idx)
{
if (idx == MEMCG_PERCPU_B)
......@@ -743,15 +816,19 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
static inline void __mod_memcg_page_state(struct page *page,
int idx, int val)
{
if (page->mem_cgroup)
__mod_memcg_state(page->mem_cgroup, idx, val);
struct mem_cgroup *memcg = page_memcg(page);
if (memcg)
__mod_memcg_state(memcg, idx, val);
}
static inline void mod_memcg_page_state(struct page *page,
int idx, int val)
{
if (page->mem_cgroup)
mod_memcg_state(page->mem_cgroup, idx, val);
struct mem_cgroup *memcg = page_memcg(page);
if (memcg)
mod_memcg_state(memcg, idx, val);
}
static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
......@@ -834,16 +911,17 @@ static inline void __mod_lruvec_page_state(struct page *page,
enum node_stat_item idx, int val)
{
struct page *head = compound_head(page); /* rmap on tail pages */
struct mem_cgroup *memcg = page_memcg(head);
pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;
/* Untracked pages have no memcg, no lruvec. Update only the node */
if (!head->mem_cgroup) {
if (!memcg) {
__mod_node_page_state(pgdat, idx, val);
return;
}
lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
__mod_lruvec_state(lruvec, idx, val);
}
......@@ -878,8 +956,10 @@ static inline void count_memcg_events(struct mem_cgroup *memcg,
static inline void count_memcg_page_event(struct page *page,
enum vm_event_item idx)
{
if (page->mem_cgroup)
count_memcg_events(page->mem_cgroup, idx, 1);
struct mem_cgroup *memcg = page_memcg(page);
if (memcg)
count_memcg_events(memcg, idx, 1);
}
static inline void count_memcg_event_mm(struct mm_struct *mm,
......@@ -941,6 +1021,22 @@ void mem_cgroup_split_huge_fixup(struct page *head);
struct mem_cgroup;
static inline struct mem_cgroup *page_memcg(struct page *page)
{
return NULL;
}
static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
{
WARN_ON_ONCE(!rcu_read_lock_held());
return NULL;
}
static inline struct mem_cgroup *page_memcg_check(struct page *page)
{
return NULL;
}
static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
{
return true;
......@@ -1430,7 +1526,7 @@ static inline void mem_cgroup_track_foreign_dirty(struct page *page,
if (mem_cgroup_disabled())
return;
if (unlikely(&page->mem_cgroup->css != wb->memcg_css))
if (unlikely(&page_memcg(page)->css != wb->memcg_css))
mem_cgroup_track_foreign_dirty_slowpath(page, wb);
}
......
......@@ -1484,28 +1484,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
#endif
}
#ifdef CONFIG_MEMCG
static inline struct mem_cgroup *page_memcg(struct page *page)
{
return page->mem_cgroup;
}
static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
{
WARN_ON_ONCE(!rcu_read_lock_held());
return READ_ONCE(page->mem_cgroup);
}
#else
static inline struct mem_cgroup *page_memcg(struct page *page)
{
return NULL;
}
static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
{
WARN_ON_ONCE(!rcu_read_lock_held());
return NULL;
}
#endif
/*
* Some inline functions in vmstat.h depend on page_zone()
*/
......
......@@ -199,10 +199,7 @@ struct page {
atomic_t _refcount;
#ifdef CONFIG_MEMCG
union {
struct mem_cgroup *mem_cgroup;
struct obj_cgroup **obj_cgroups;
};
unsigned long memcg_data;
#endif
/*
......
......@@ -257,7 +257,7 @@ TRACE_EVENT(track_foreign_dirty,
__entry->ino = inode ? inode->i_ino : 0;
__entry->memcg_id = wb->memcg_css->id;
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
__entry->page_cgroup_ino = cgroup_ino(page->mem_cgroup->css.cgroup);
__entry->page_cgroup_ino = cgroup_ino(page_memcg(page)->css.cgroup);
),
TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu",
......
......@@ -404,9 +404,10 @@ static int memcg_charge_kernel_stack(struct task_struct *tsk)
for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
/*
* If memcg_kmem_charge_page() fails, page->mem_cgroup
* pointer is NULL, and memcg_kmem_uncharge_page() in
* free_thread_stack() will ignore this page.
* If memcg_kmem_charge_page() fails, page's
* memory cgroup pointer is NULL, and
* memcg_kmem_uncharge_page() in free_thread_stack()
* will ignore this page.
*/
ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL,
0);
......
......@@ -182,8 +182,8 @@ void __dump_page(struct page *page, const char *reason)
pr_warn("page dumped because: %s\n", reason);
#ifdef CONFIG_MEMCG
if (!page_poisoned && page->mem_cgroup)
pr_warn("page->mem_cgroup:%px\n", page->mem_cgroup);
if (!page_poisoned && page->memcg_data)
pr_warn("pages's memcg:%lx\n", page->memcg_data);
#endif
}
......
......@@ -470,7 +470,7 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
#ifdef CONFIG_MEMCG
static inline struct deferred_split *get_deferred_split_queue(struct page *page)
{
struct mem_cgroup *memcg = compound_head(page)->mem_cgroup;
struct mem_cgroup *memcg = page_memcg(compound_head(page));
struct pglist_data *pgdat = NODE_DATA(page_to_nid(page));
if (memcg)
......@@ -2765,7 +2765,7 @@ void deferred_split_huge_page(struct page *page)
{
struct deferred_split *ds_queue = get_deferred_split_queue(page);
#ifdef CONFIG_MEMCG
struct mem_cgroup *memcg = compound_head(page)->mem_cgroup;
struct mem_cgroup *memcg = page_memcg(compound_head(page));
#endif
unsigned long flags;
......
This diff is collapsed.
......@@ -1092,7 +1092,7 @@ static inline bool page_expected_state(struct page *page,
if (unlikely((unsigned long)page->mapping |
page_ref_count(page) |
#ifdef CONFIG_MEMCG
(unsigned long)page->mem_cgroup |
(unsigned long)page_memcg(page) |
#endif
(page->flags & check_flags)))
return false;
......@@ -1117,7 +1117,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set";
}
#ifdef CONFIG_MEMCG
if (unlikely(page->mem_cgroup))
if (unlikely(page_memcg(page)))
bad_reason = "page still charged to cgroup";
#endif
return bad_reason;
......
......@@ -291,12 +291,14 @@ static inline void count_swpout_vm_event(struct page *page)
static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
{
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg;
if (!page->mem_cgroup)
memcg = page_memcg(page);
if (!memcg)
return;
rcu_read_lock();
css = cgroup_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys);
css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
bio_associate_blkg_from_css(bio, css);
rcu_read_unlock();
}
......
......@@ -242,18 +242,17 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
{
/*
* page->mem_cgroup and page->obj_cgroups are sharing the same
* Page's memory cgroup and obj_cgroups vector are sharing the same
* space. To distinguish between them in case we don't know for sure
* that the page is a slab page (e.g. page_cgroup_ino()), let's
* always set the lowest bit of obj_cgroups.
*/
return (struct obj_cgroup **)
((unsigned long)page->obj_cgroups & ~0x1UL);
return (struct obj_cgroup **)(page->memcg_data & ~0x1UL);
}
static inline bool page_has_obj_cgroups(struct page *page)
{
return ((unsigned long)page->obj_cgroups & 0x1UL);
return page->memcg_data & 0x1UL;
}
int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
......@@ -262,7 +261,7 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
static inline void memcg_free_page_obj_cgroups(struct page *page)
{
kfree(page_obj_cgroups(page));
page->obj_cgroups = NULL;
page->memcg_data = 0;
}
static inline size_t obj_full_size(struct kmem_cache *s)
......
......@@ -257,7 +257,7 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
struct lruvec *lruvec;
int memcgid;
/* Page is fully exclusive and pins page->mem_cgroup */
/* Page is fully exclusive and pins page's memory cgroup pointer */
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment