• Huang Ying's avatar
    swap: try to scan more free slots even when fragmented · ed43af10
    Huang Ying authored
    Now, the scalability of swap code will drop much when the swap device
    becomes fragmented, because the swap slots allocation batching stops
    working.  To solve the problem, in this patch, we will try to scan a
    little more swap slots with restricted effort to batch the swap slots
    allocation even if the swap device is fragmented.  Test shows that the
    benchmark score can increase up to 37.1% with the patch.  Details are as
    follows.
    
    The swap code has a per-cpu cache of swap slots.  These batch swap space
    allocations to improve swap subsystem scaling.  In the following code
    path,
    
      add_to_swap()
        get_swap_page()
          refill_swap_slots_cache()
            get_swap_pages()
    	  scan_swap_map_slots()
    
    scan_swap_map_slots() and get_swap_pages() can return multiple swap
    slots for each call.  These slots will be cached in the per-CPU swap
    slots cache, so that several following swap slot requests will be
    fulfilled there to avoid the lock contention in the lower level swap
    space allocation/freeing code path.
    
    But this only works when there are free swap clusters.  If a swap device
    becomes so fragmented that there's no free swap clusters,
    scan_swap_map_slots() and get_swap_pages() will return only one swap
    slot for each call in the above code path.  Effectively, this falls back
    to the situation before the swap slots cache was introduced, the heavy
    lock contention on the swap related locks kills the scalability.
    
    Why does it work in this way? Because the swap device could be large,
    and the free swap slot scanning could be quite time consuming, to avoid
    taking too much time to scanning free swap slots, the conservative
    method was used.
    
    In fact, this can be improved via scanning a little more free slots with
    strictly restricted effort.  Which is implemented in this patch.  In
    scan_swap_map_slots(), after the first free swap slot is gotten, we will
    try to scan a little more, but only if we haven't scanned too many slots
    (< LATENCY_LIMIT).  That is, the added scanning latency is strictly
    restricted.
    
    To test the patch, we have run 16-process pmbench memory benchmark on a
    2-socket server machine with 48 cores.  Multiple ram disks are
    configured as the swap devices.  The pmbench working-set size is much
    larger than the available memory so that swapping is triggered.  The
    memory read/write ratio is 80/20 and the accessing pattern is random, so
    the swap space becomes highly fragmented during the test.  In the
    original implementation, the lock contention on swap related locks is
    very heavy.  The perf profiling data of the lock contention code path is
    as following,
    
     _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap:             21.03
     _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    1.92
     _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      1.72
     _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       0.69
    
    While after applying this patch, it becomes,
    
     _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    4.89
     _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      3.85
     _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       1.1
     _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88
    
    That is, the lock contention on the swap locks is eliminated.
    
    And the pmbench score increases 37.1%.  The swapin throughput increases
    45.7% from 2.02 GB/s to 2.94 GB/s.  While the swapout throughput increases
    45.3% from 2.04 GB/s to 2.97 GB/s.
    Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    ed43af10
swapfile.c 95.3 KB