• KAMEZAWA Hiroyuki's avatar
    memcg: avoid css_get() · f75ca962
    KAMEZAWA Hiroyuki authored
    Now, memory cgroup increments css(cgroup subsys state)'s reference count
    per a charged page.  And the reference count is kept until the page is
    uncharged.  But this has 2 bad effect.
    
     1. Because css_get/put calls atomic_inc()/dec, heavy call of them
        on large smp will not scale well.
     2. Because css's refcnt cannot be in a state as "ready-to-release",
        cgroup's notify_on_release handler can't work with memcg.
     3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.
    
    This has been a problem since the 1st merge of memcg.
    
    This is a trial to remove css's refcnt per a page. Even if we remove
    refcnt, pre_destroy() does enough synchronization as
      - check res->usage == 0.
      - check no pages on LRU.
    
    This patch removes css's refcnt per page.  Even after this patch, at the
    1st look, it seems css_get() is still called in try_charge().
    
    But the logic is.
    
      - If a memcg of mm->owner is cached one, consume_stock() will work.
        At success, return immediately.
      - If consume_stock returns false, css_get() is called and go to
        slow path which may be blocked. At the end of slow path,
        css_put() is called and restart from the start if necessary.
    
    So, in the fast path, we don't call css_get() and can avoid access to
    shared counter. This patch can make the most possible case fast.
    
    Here is a result of multi-threaded page fault benchmark.
    
    [Before]
        25.32%  multi-fault-all  [kernel.kallsyms]      [k] clear_page_c
         9.30%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
         8.02%  multi-fault-all  [kernel.kallsyms]      [k] try_get_mem_cgroup_from_mm <=====(*)
         7.83%  multi-fault-all  [kernel.kallsyms]      [k] down_read_trylock
         5.38%  multi-fault-all  [kernel.kallsyms]      [k] __css_put
         5.29%  multi-fault-all  [kernel.kallsyms]      [k] __alloc_pages_nodemask
         4.92%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irq
         4.24%  multi-fault-all  [kernel.kallsyms]      [k] up_read
         3.53%  multi-fault-all  [kernel.kallsyms]      [k] css_put
         2.11%  multi-fault-all  [kernel.kallsyms]      [k] handle_mm_fault
         1.76%  multi-fault-all  [kernel.kallsyms]      [k] __rmqueue
         1.64%  multi-fault-all  [kernel.kallsyms]      [k] __mem_cgroup_commit_charge
    
    [After]
        28.41%  multi-fault-all  [kernel.kallsyms]      [k] clear_page_c
        10.08%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irq
         9.58%  multi-fault-all  [kernel.kallsyms]      [k] down_read_trylock
         9.38%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
         5.86%  multi-fault-all  [kernel.kallsyms]      [k] __alloc_pages_nodemask
         5.65%  multi-fault-all  [kernel.kallsyms]      [k] up_read
         2.82%  multi-fault-all  [kernel.kallsyms]      [k] handle_mm_fault
         2.64%  multi-fault-all  [kernel.kallsyms]      [k] mem_cgroup_add_lru_list
         2.48%  multi-fault-all  [kernel.kallsyms]      [k] __mem_cgroup_commit_charge
    
    Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch
    removes css_tryget() in it. (But yes, this is an extreme case.)
    Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Cc: Balbir Singh <balbir@in.ibm.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    f75ca962
memcontrol.c 120 KB