• Michal Hocko's avatar
    mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache · aabfb572
    Michal Hocko authored
    free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
    This is not a big deal for the normal release path but it completely kills
    memcg uncharge batching which reduces res_counter spin_lock contention.
    Dave has noticed this with his page fault scalability test case on a large
    machine when the lock was basically dominating on all CPUs:
    
        80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                      |
                      --- _raw_spin_lock
                         |
                         |--66.59%-- res_counter_uncharge_until
                         |          res_counter_uncharge
                         |          uncharge_batch
                         |          uncharge_list
                         |          mem_cgroup_uncharge_list
                         |          release_pages
                         |          free_pages_and_swap_cache
                         |          tlb_flush_mmu_free
                         |          |
                         |          |--90.12%-- unmap_single_vma
                         |          |          unmap_vmas
                         |          |          unmap_region
                         |          |          do_munmap
                         |          |          vm_munmap
                         |          |          sys_munmap
                         |          |          system_call_fastpath
                         |          |          __GI___munmap
                         |          |
                         |           --9.88%-- tlb_flush_mmu
                         |                     tlb_finish_mmu
                         |                     unmap_region
                         |                     do_munmap
                         |                     vm_munmap
                         |                     sys_munmap
                         |                     system_call_fastpath
                         |                     __GI___munmap
    
    In his case the load was running in the root memcg and that part has been
    handled by reverting 05b84301 ("mm: memcontrol: use root_mem_cgroup
    res_counter") because this is a clear regression, but the problem remains
    inside dedicated memcgs.
    
    There is no reason to limit release_pages to PAGEVEC_SIZE batches other
    than lru_lock held times.  This logic, however, can be moved inside the
    function.  mem_cgroup_uncharge_list and free_hot_cold_page_list do not
    hold any lock for the whole pages_to_free list so it is safe to call them
    in a single run.
    
    The release_pages() code was previously breaking the lru_lock each
    PAGEVEC_SIZE pages (ie, 14 pages).  However this code has no usage of
    pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
    (32) pages.  This means that the lock acquisition frequency is
    approximately halved and the max hold times are approximately doubled.
    
    The now unneeded batching is removed from free_pages_and_swap_cache().
    
    Also update the grossly out-of-date release_pages documentation.
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reported-by: default avatarDave Hansen <dave@sr71.net>
    Cc: Vladimir Davydov <vdavydov@parallels.com>
    Cc: Greg Thelen <gthelen@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    aabfb572
swap_state.c 12.9 KB