Commit 70ddf637 authored by Anton Vorontsov's avatar Anton Vorontsov Committed by Linus Torvalds

memcg: add memory.pressure_level events

With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A->B->C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: default avatarAnton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 84d96d89
...@@ -40,6 +40,7 @@ Features: ...@@ -40,6 +40,7 @@ Features:
- soft limit - soft limit
- moving (recharging) account at moving a task is selectable. - moving (recharging) account at moving a task is selectable.
- usage threshold notifier - usage threshold notifier
- memory pressure notifier
- oom-killer disable knob and oom-notifier - oom-killer disable knob and oom-notifier
- Root cgroup has no limit controls. - Root cgroup has no limit controls.
...@@ -65,6 +66,7 @@ Brief summary of control files. ...@@ -65,6 +66,7 @@ Brief summary of control files.
memory.stat # show various statistics memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled memory.use_hierarchy # set/show hierarchical account enabled
memory.force_empty # trigger forced move charge to parent memory.force_empty # trigger forced move charge to parent
memory.pressure_level # set memory pressure notifications
memory.swappiness # set/show swappiness parameter of vmscan memory.swappiness # set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness) (See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges memory.move_charge_at_immigrate # set/show controls of moving charges
...@@ -762,7 +764,73 @@ At reading, current status of OOM is shown. ...@@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.) be stopped.)
11. TODO 11. Memory Pressure
The pressure level notifications can be used to monitor the memory
allocation cost; based on the pressure, applications can implement
different strategies of managing their memory resources. The pressure
levels are defined as following:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
and C, and suppose group C experiences some pressure. In this situation,
only group C will receive the notification, i.e. groups A and B will not
receive it. This is done to avoid excessive "broadcasting" of messages,
which disturbs the system and which is especially bad if we are low on
memory or thrashing. So, organize the cgroups wisely, or propagate the
events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)
The file memory.pressure_level is only used to setup an eventfd. To
register a notification, an application must:
- create an eventfd using eventfd(2);
- open memory.pressure_level;
- write string like "<event_fd> <fd of memory.pressure_level> <level>"
to cgroup.event_control.
Application will be notified through eventfd when memory pressure is at
the specific level (or higher). Read/write operations to
memory.pressure_level are no implemented.
Test:
Here is a small script example that makes a new cgroup, sets up a
memory limit, sets up a notification in the cgroup and then makes child
cgroup experience a critical pressure:
# cd /sys/fs/cgroup/memory/
# mkdir foo
# cd foo
# cgroup_event_listener memory.pressure_level low &
# echo 8000000 > memory.limit_in_bytes
# echo 8000000 > memory.memsw.limit_in_bytes
# echo $$ > tasks
# dd if=/dev/zero | read x
(Expect a bunch of notifications, and eventually, the oom-killer will
trigger.)
12. TODO
1. Add support for accounting huge pages (as a separate controller) 1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first 2. Make per-cgroup scanner reclaim not-shared pages first
......
#ifndef __LINUX_VMPRESSURE_H
#define __LINUX_VMPRESSURE_H
#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/workqueue.h>
#include <linux/gfp.h>
#include <linux/types.h>
#include <linux/cgroup.h>
struct vmpressure {
unsigned long scanned;
unsigned long reclaimed;
/* The lock is used to keep the scanned/reclaimed above in sync. */
struct mutex sr_lock;
/* The list of vmpressure_event structs. */
struct list_head events;
/* Have to grab the lock on events traversal or modifications. */
struct mutex events_lock;
struct work_struct work;
};
struct mem_cgroup;
#ifdef CONFIG_MEMCG
extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed);
extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
extern void vmpressure_init(struct vmpressure *vmpr);
extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
extern struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr);
extern struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css);
extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd,
const char *args);
extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd);
#else
static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed) {}
static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
int prio) {}
#endif /* CONFIG_MEMCG */
#endif /* __LINUX_VMPRESSURE_H */
...@@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o ...@@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
......
...@@ -49,6 +49,7 @@ ...@@ -49,6 +49,7 @@
#include <linux/fs.h> #include <linux/fs.h>
#include <linux/seq_file.h> #include <linux/seq_file.h>
#include <linux/vmalloc.h> #include <linux/vmalloc.h>
#include <linux/vmpressure.h>
#include <linux/mm_inline.h> #include <linux/mm_inline.h>
#include <linux/page_cgroup.h> #include <linux/page_cgroup.h>
#include <linux/cpu.h> #include <linux/cpu.h>
...@@ -261,6 +262,9 @@ struct mem_cgroup { ...@@ -261,6 +262,9 @@ struct mem_cgroup {
*/ */
struct res_counter res; struct res_counter res;
/* vmpressure notifications */
struct vmpressure vmpressure;
union { union {
/* /*
* the counter to account for mem+swap usage. * the counter to account for mem+swap usage.
...@@ -359,6 +363,7 @@ struct mem_cgroup { ...@@ -359,6 +363,7 @@ struct mem_cgroup {
atomic_t numainfo_events; atomic_t numainfo_events;
atomic_t numainfo_updating; atomic_t numainfo_updating;
#endif #endif
/* /*
* Per cgroup active and inactive list, similar to the * Per cgroup active and inactive list, similar to the
* per zone LRU lists. * per zone LRU lists.
...@@ -510,6 +515,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) ...@@ -510,6 +515,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
return container_of(s, struct mem_cgroup, css); return container_of(s, struct mem_cgroup, css);
} }
/* Some nice accessors for the vmpressure. */
struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
{
if (!memcg)
memcg = root_mem_cgroup;
return &memcg->vmpressure;
}
struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
{
return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
}
struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css)
{
return &mem_cgroup_from_css(css)->vmpressure;
}
static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
{ {
return (memcg == root_mem_cgroup); return (memcg == root_mem_cgroup);
...@@ -5907,6 +5930,11 @@ static struct cftype mem_cgroup_files[] = { ...@@ -5907,6 +5930,11 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event, .unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
}, },
{
.name = "pressure_level",
.register_event = vmpressure_register_event,
.unregister_event = vmpressure_unregister_event,
},
#ifdef CONFIG_NUMA #ifdef CONFIG_NUMA
{ {
.name = "numa_stat", .name = "numa_stat",
...@@ -6188,6 +6216,7 @@ mem_cgroup_css_alloc(struct cgroup *cont) ...@@ -6188,6 +6216,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
memcg->move_charge_at_immigrate = 0; memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock); mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock); spin_lock_init(&memcg->move_lock);
vmpressure_init(&memcg->vmpressure);
return &memcg->css; return &memcg->css;
......
This diff is collapsed.
...@@ -19,6 +19,7 @@ ...@@ -19,6 +19,7 @@
#include <linux/pagemap.h> #include <linux/pagemap.h>
#include <linux/init.h> #include <linux/init.h>
#include <linux/highmem.h> #include <linux/highmem.h>
#include <linux/vmpressure.h>
#include <linux/vmstat.h> #include <linux/vmstat.h>
#include <linux/file.h> #include <linux/file.h>
#include <linux/writeback.h> #include <linux/writeback.h>
...@@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) ...@@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
} }
memcg = mem_cgroup_iter(root, memcg, &reclaim); memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg); } while (memcg);
vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
sc->nr_scanned - nr_scanned,
sc->nr_reclaimed - nr_reclaimed);
} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc)); sc->nr_scanned - nr_scanned, sc));
} }
...@@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, ...@@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
count_vm_event(ALLOCSTALL); count_vm_event(ALLOCSTALL);
do { do {
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0; sc->nr_scanned = 0;
aborted_reclaim = shrink_zones(zonelist, sc); aborted_reclaim = shrink_zones(zonelist, sc);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment