• Michal Hocko's avatar
    memcg: do not hang on OOM when killed by userspace OOM access to memory reserves · d8dc595c
    Michal Hocko authored
    Eric has reported that he can see task(s) stuck in memcg OOM handler
    regularly.  The only way out is to
    
    	echo 0 > $GROUP/memory.oom_control
    
    His usecase is:
    
    - Setup a hierarchy with memory and the freezer (disable kernel oom and
      have a process watch for oom).
    
    - In that memory cgroup add a process with one thread per cpu.
    
    - In one thread slowly allocate once per second I think it is 16M of ram
      and mlock and dirty it (just to force the pages into ram and stay
      there).
    
    - When oom is achieved loop:
      * attempt to freeze all of the tasks.
      * if frozen send every task SIGKILL, unfreeze, remove the directory in
        cgroupfs.
    
    Eric has then pinpointed the issue to be memcg specific.
    
    All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
    Those that have received fatal signal will bypass the charge and should
    continue on their way out.  The tricky part is that the exit path might
    trigger a page fault (e.g.  exit_robust_list), thus the memcg charge,
    while its memcg is still under OOM because nobody has released any charges
    yet.
    
    Unlike with the in-kernel OOM handler the exiting task doesn't get
    TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
    and falls to the memcg OOM again without any way out of it as there are no
    fatal signals pending anymore.
    
    This patch fixes the issue by checking PF_EXITING early in
    mem_cgroup_try_charge and bypass the charge same as if it had fatal
    signal pending or TIF_MEMDIE set.
    
    Normally exiting tasks (aka not killed) will bypass the charge now but
    this should be OK as the task is leaving and will release memory and
    increasing the memory pressure just to release it in a moment seems
    dubious wasting of cycles.  Besides that charges after exit_signals should
    be rare.
    
    I am bringing this patch again (rebased on the current mmotm tree). I
    hope we can move forward finally. If there is still an opposition then
    I would really appreciate a concurrent approach so that we can discuss
    alternatives.
    
    http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
    to the followup discussion when the patch has been dropped from the mmotm
    last time.
    Reported-by: default avatarEric W. Biederman <ebiederm@xmission.com>
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
    Acked-by: default avatarDavid Rientjes <rientjes@google.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    d8dc595c
memcontrol.c 190 KB