• Dave Chinner's avatar
    xfs: make inode reclaim almost non-blocking · 993f951f
    Dave Chinner authored
    Now that dirty inode writeback doesn't cause read-modify-write
    cycles on the inode cluster buffer under memory pressure, the need
    to throttle memory reclaim to the rate at which we can clean dirty
    inodes goes away. That is due to the fact that we no longer thrash
    inode cluster buffers under memory pressure to clean dirty inodes.
    
    This means inode writeback no longer stalls on memory allocation
    or read IO, and hence can be done asynchronously without generating
    memory pressure. As a result, blocking inode writeback in reclaim is
    no longer necessary to prevent reclaim priority windup as cleaning
    dirty inodes is no longer dependent on having memory reserves
    available for the filesystem to make progress reclaiming inodes.
    
    Hence we can convert inode reclaim to be non-blocking for shrinker
    callouts, both for direct reclaim and kswapd.
    
    On a vanilla kernel, running a 16-way fsmark create workload on a
    4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
    userspace mlock(). The OOM killer gets invoked at 15GB of
    pinned RAM.
    
    Without the inode cluster pinning, this non-blocking reclaim patch
    triggers premature OOM killer invocation with the same memory
    pinning, sometimes with as much as 45% of RAM being free.  It's
    trivially easy to trigger the OOM killer when reclaim does not
    block.
    
    With pinning inode clusters in RAM and then adding this patch, I can
    reliably pin 14.5GB of RAM and still have the fsmark workload run to
    completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
    is only a small amount of memory less than the vanilla kernel. It is
    much more reliable than just with async reclaim alone.
    
    simoops shows that allocation stalls go away when async reclaim is
    used. Vanilla kernel:
    
    Run time: 1924 seconds
    Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
    Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
    Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
    work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
    alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
    
    With inode cluster pinning and async reclaim:
    
    Run time: 1924 seconds
    Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
    Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
    Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
    work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
    alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
    
    Latencies don't really change much, nor does the work rate. However,
    allocation almost never stalls with these changes, whilst the
    vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
    period. This difference is due to inode reclaim being largely
    non-blocking now.
    
    IOWs, once we have pinned inode cluster buffers, we can make inode
    reclaim non-blocking without a major risk of premature and/or
    spurious OOM killer invocation, and without any changes to memory
    reclaim infrastructure.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
    Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
    Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    993f951f
xfs_icache.c 46.9 KB