• Johannes Weiner's avatar
    mm: memcontrol: default hierarchy interface for memory · 241994ed
    Johannes Weiner authored
    Introduce the basic control files to account, partition, and limit
    memory using cgroups in default hierarchy mode.
    
    This interface versioning allows us to address fundamental design
    issues in the existing memory cgroup interface, further explained
    below.  The old interface will be maintained indefinitely, but a
    clearer model and improved workload performance should encourage
    existing users to switch over to the new one eventually.
    
    The control files are thus:
    
      - memory.current shows the current consumption of the cgroup and its
        descendants, in bytes.
    
      - memory.low configures the lower end of the cgroup's expected
        memory consumption range.  The kernel considers memory below that
        boundary to be a reserve - the minimum that the workload needs in
        order to make forward progress - and generally avoids reclaiming
        it, unless there is an imminent risk of entering an OOM situation.
    
      - memory.high configures the upper end of the cgroup's expected
        memory consumption range.  A cgroup whose consumption grows beyond
        this threshold is forced into direct reclaim, to work off the
        excess and to throttle new allocations heavily, but is generally
        allowed to continue and the OOM killer is not invoked.
    
      - memory.max configures the hard maximum amount of memory that the
        cgroup is allowed to consume before the OOM killer is invoked.
    
      - memory.events shows event counters that indicate how often the
        cgroup was reclaimed while below memory.low, how often it was
        forced to reclaim excess beyond memory.high, how often it hit
        memory.max, and how often it entered OOM due to memory.max.  This
        allows users to identify configuration problems when observing a
        degradation in workload performance.  An overcommitted system will
        have an increased rate of low boundary breaches, whereas increased
        rates of high limit breaches, maximum hits, or even OOM situations
        will indicate internally overcommitted cgroups.
    
    For existing users of memory cgroups, the following deviations from
    the current interface are worth pointing out and explaining:
    
      - The original lower boundary, the soft limit, is defined as a limit
        that is per default unset.  As a result, the set of cgroups that
        global reclaim prefers is opt-in, rather than opt-out.  The costs
        for optimizing these mostly negative lookups are so high that the
        implementation, despite its enormous size, does not even provide
        the basic desirable behavior.  First off, the soft limit has no
        hierarchical meaning.  All configured groups are organized in a
        global rbtree and treated like equal peers, regardless where they
        are located in the hierarchy.  This makes subtree delegation
        impossible.  Second, the soft limit reclaim pass is so aggressive
        that it not just introduces high allocation latencies into the
        system, but also impacts system performance due to overreclaim, to
        the point where the feature becomes self-defeating.
    
        The memory.low boundary on the other hand is a top-down allocated
        reserve.  A cgroup enjoys reclaim protection when it and all its
        ancestors are below their low boundaries, which makes delegation
        of subtrees possible.  Secondly, new cgroups have no reserve per
        default and in the common case most cgroups are eligible for the
        preferred reclaim pass.  This allows the new low boundary to be
        efficiently implemented with just a minor addition to the generic
        reclaim code, without the need for out-of-band data structures and
        reclaim passes.  Because the generic reclaim code considers all
        cgroups except for the ones running low in the preferred first
        reclaim pass, overreclaim of individual groups is eliminated as
        well, resulting in much better overall workload performance.
    
      - The original high boundary, the hard limit, is defined as a strict
        limit that can not budge, even if the OOM killer has to be called.
        But this generally goes against the goal of making the most out of
        the available memory.  The memory consumption of workloads varies
        during runtime, and that requires users to overcommit.  But doing
        that with a strict upper limit requires either a fairly accurate
        prediction of the working set size or adding slack to the limit.
        Since working set size estimation is hard and error prone, and
        getting it wrong results in OOM kills, most users tend to err on
        the side of a looser limit and end up wasting precious resources.
    
        The memory.high boundary on the other hand can be set much more
        conservatively.  When hit, it throttles allocations by forcing
        them into direct reclaim to work off the excess, but it never
        invokes the OOM killer.  As a result, a high boundary that is
        chosen too aggressively will not terminate the processes, but
        instead it will lead to gradual performance degradation.  The user
        can monitor this and make corrections until the minimal memory
        footprint that still gives acceptable performance is found.
    
        In extreme cases, with many concurrent allocations and a complete
        breakdown of reclaim progress within the group, the high boundary
        can be exceeded.  But even then it's mostly better to satisfy the
        allocation from the slack available in other groups or the rest of
        the system than killing the group.  Otherwise, memory.max is there
        to limit this type of spillover and ultimately contain buggy or
        even malicious applications.
    
      - The original control file names are unwieldy and inconsistent in
        many different ways.  For example, the upper boundary hit count is
        exported in the memory.failcnt file, but an OOM event count has to
        be manually counted by listening to memory.oom_control events, and
        lower boundary / soft limit events have to be counted by first
        setting a threshold for that value and then counting those events.
        Also, usage and limit files encode their units in the filename.
        That makes the filenames very long, even though this is not
        information that a user needs to be reminded of every time they
        type out those names.
    
        To address these naming issues, as well as to signal clearly that
        the new interface carries a new configuration model, the naming
        conventions in it necessarily differ from the old interface.
    
      - The original limit files indicate the state of an unset limit with
        a very high number, and a configured limit can be unset by echoing
        -1 into those files.  But that very high number is implementation
        and architecture dependent and not very descriptive.  And while -1
        can be understood as an underflow into the highest possible value,
        -2 or -10M etc. do not work, so it's not inconsistent.
    
        memory.low, memory.high, and memory.max will use the string
        "infinity" to indicate and set the highest possible value.
    
    [akpm@linux-foundation.org: use seq_puts() for basic strings]
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
    Cc: Vladimir Davydov <vdavydov@parallels.com>
    Cc: Greg Thelen <gthelen@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    241994ed
memcontrol.c 150 KB