• Lukasz Odzioba's avatar
    mm/swap.c: flush lru pvecs on compound page arrival · af809c0e
    Lukasz Odzioba authored
    commit 8f182270 upstream.
    
    Currently we can have compound pages held on per cpu pagevecs, which
    leads to a lot of memory unavailable for reclaim when needed.  In the
    systems with hundreads of processors it can be GBs of memory.
    
    On of the way of reproducing the problem is to not call munmap
    explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
    that some pages (with THP enabled also huge pages) may end up on
    lru_add_pvec, example below.
    
      void main() {
      #pragma omp parallel
      {
    	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
    	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
    		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
    	if (p != MAP_FAILED)
    		memset(p, 0, size);
    	//munmap(p, size); // uncomment to make the problem go away
      }
      }
    
    When we run it with THP enabled it will leave significant amount of
    memory on lru_add_pvec.  This memory will be not reclaimed if we hit
    OOM, so when we run above program in a loop:
    
    	for i in `seq 100`; do ./a.out; done
    
    many processes (95% in my case) will be killed by OOM.
    
    The primary point of the LRU add cache is to save the zone lru_lock
    contention with a hope that more pages will belong to the same zone and
    so their addition can be batched.  The huge page is already a form of
    batched addition (it will add 512 worth of memory in one go) so skipping
    the batching seems like a safer option when compared to a potential
    excess in the caching which can be quite large and much harder to fix
    because lru_add_drain_all is way to expensive and it is not really clear
    what would be a good moment to call it.
    
    Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
    madvise(p, size, MADV_FREE); after memset.
    
    This patch flushes lru pvecs on compound page arrival making the problem
    less severe - after applying it kill rate of above example drops to 0%,
    due to reducing maximum amount of memory held on pvec from 28MB (with
    THP) to 56kB per CPU.
    Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
    Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: default avatarLukasz Odzioba <lukasz.odzioba@intel.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Vladimir Davydov <vdavydov@parallels.com>
    Cc: Ming Li <mingli199x@qq.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    af809c0e
swap.c 26.3 KB