• Andrea Arcangeli's avatar
    mm: ext4 livelock during OOM · 7636db0a
    Andrea Arcangeli authored
    I can easily reproduce a livelock with some trinity load on a 2GB
    guest running in parallel:
    
    	./trinity -X -c remap_anon_pages -q
    	./trinity -X -c userfaultfd -q
    
    The last OOM killer invocation selects this task:
    
    Out of memory: Kill process 6537 (trinity-c6) score 106 or sacrifice child
    Killed process 6537 (trinity-c6) total-vm:414772kB, anon-rss:186744kB, file-rss:560kB
    
    The victim task shortly later is detected in uninterruptible state for
    too long by the hangcheck timer:
    
    INFO: task trinity-c6:6537 blocked for more than 120 seconds.
          Not tainted 3.16.0-rc1+ #4
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    trinity-c6      D ffff88004ec37b50 11080  6537   5530 0x00100004
     ffff88004ec37aa8 0000000000000082 0000000000000000 ffff880039174910
     ffff88004ec37fd8 0000000000004000 ffff88007c8de3d0 ffff880039174910
     0000000000000000 0000000000000000 0000000000000000 0000000000000000
    Call Trace:
     [<ffffffff8167c759>] ? schedule+0x29/0x70
     [<ffffffff8112ae42>] ? __delayacct_blkio_start+0x22/0x30
     [<ffffffff8116e290>] ? __lock_page+0x70/0x70
     [<ffffffff8167c759>] schedule+0x29/0x70
     [<ffffffff8167c82f>] io_schedule+0x8f/0xd0
     [<ffffffff8116e29e>] sleep_on_page+0xe/0x20
     [<ffffffff8167cd33>] __wait_on_bit_lock+0x73/0xb0
     [<ffffffff8116e287>] __lock_page+0x67/0x70
     [<ffffffff810c34d0>] ? wake_atomic_t_function+0x40/0x40
     [<ffffffff8116f125>] pagecache_get_page+0x165/0x1f0
     [<ffffffff8116f3d4>] grab_cache_page_write_begin+0x34/0x50
     [<ffffffff81268f82>] ext4_da_write_begin+0x92/0x380
     [<ffffffff8116d717>] generic_perform_write+0xc7/0x1d0
     [<ffffffff8116fee3>] __generic_file_write_iter+0x173/0x350
     [<ffffffff8125e6ad>] ext4_file_write_iter+0x10d/0x3c0
     [<ffffffff811db72b>] ? vfs_write+0x1bb/0x1f0
     [<ffffffff811da581>] new_sync_write+0x81/0xb0
     [<ffffffff811db62f>] vfs_write+0xbf/0x1f0
     [<ffffffff811dbb72>] SyS_write+0x52/0xc0
     [<ffffffff816816d2>] system_call_fastpath+0x16/0x1b
    3 locks held by trinity-c6/6537:
     #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811fc0ee>] __fdget_pos+0x3e/0x50
     #1:  (sb_writers#3){.+.+.+}, at: [<ffffffff811db72b>] vfs_write+0x1bb/0x1f0
     #2:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8125e62d>] ext4_file_write_iter+0x8d/0x3c0
    
    The task that holds the page lock is likely the below one that never
    returns from __alloc_pages_slowpath.
    
    ck_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2661
    ion>) at mm/page_alloc.c:2821
    0, radix_gfp_mask=0x50) at mm/filemap.c:1096
    emap.h:336
     at fs/ext4/mballoc.c:4442
    50, flags=0x25) at fs/ext4/extents.c:4453
    flags=0x25) at fs/ext4/inode.c:648
    64
    
    (full stack trace of the !__GFP_FS allocation in the kernel thread
    holding the page lock below)
    
    gfp_mask in __alloc_pages_slowpath is gfp_mask=0x50, so
    ___GFP_IO|___GFP_WAIT. ext4_writepages run from a kworker kernel
    thread is holding the page lock that the OOM victim task is waiting
    on.
    
    If alloc_pages returned NULL the whole livelock would resolve itself
    (-ENOMEM would be returned all the way up, ext4 thinks it can handle
    it, in reality it cannot but that's for a later patch in this series).
    ext4_writepages would return and the kworker would try again later to
    flush the dirty pages in the dirty inodes.
    
    To verify I breakpointed in the should_alloc_retry and added
    __GFP_NORETRY to the gfp_mask just before the __GFP_NORETRY check.
    
    gdb> b mm/page_alloc.c:2185
    Breakpoint 8 at 0xffffffff81179122: file mm/page_alloc.c, line 2185.
    gdb> c
    Continuing.
    [Switching to Thread 1]
    _______________________________________________________________________________
         eax:00000000 ebx:00000050  ecx:00000001  edx:00000001     eflags:00000206
         esi:0000196A edi:2963AA50  esp:4EDEF448  ebp:4EDEF568     eip:Error while running hook_stop:
    Value can't be converted to integer.
    
    Breakpoint 8, __alloc_pages_slowpath (migratetype=0x0, classzone_idx=0x1, preferred_zone=0xffff88007fffa840, nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2713
    
    I set the breakpoint at 2185 and it stopped at 2713 but 2713 is in the
    middle of some comment, I assume that's addr2line imperfection and
    it's not relevant.
    
    Then I simply added __GFP_NORETRY to the gfp_mask in the stack:
    
    gdb> print gfp_mask
    $1 = 0x50
    gdb> set gfp_mask = 0x1050
    gdb> p gfp_mask
    $2 = 0x1050
    gdb> c
    Continuing.
    
    After that the livelock resolved itself immediately, the OOM victim
    quit and the workload continued without errors.
    
    The problem was probably introduced in commit
    11e33f6a .
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    Retrying forever and depending on the OOM killer to send a SIGKILL is
    only ok if the victim task isn't sleeping in uninterruptible state
    waiting in this case a kernel thread to release a page lock.
    
    In this case the kernel thread holding the lock would never be picked
    by the OOM killer in the first place so this is a error path in ext4
    probably never exercised.
    
    The objective of an implicit __GFP_NOFAIL behavior (but that fails the
    allocation if TIF_MEMDIE is set on the task, unlike a real
    __GFP_NOFAIL that never fails) I assume is to avoid spurious
    VM_FAULT_OOM to make the OOM killer more reliable, but I don't think
    an almost implicit __GFP_NOFAIL is safe in general. __GFP_NOFAIL is
    unsafe too in fact but at least it's very rarely used.
    
    For now we can start by letting the allocations that hold lowlevel
    filesystem locks (__GFP_FS clear) fail so they can release those
    locks. Those locks tends to be uninterruptible too.
    
    This will reduce the scope of the problem, but I'd rather prefer to
    drop that entire check quoted above in should_alloc_retry though. If a
    kernel thread would use_mm() and then take the mmap_sem for writing,
    and then run a GFP_KERNEL allocation that invokes the OOM killer to
    kill one process that is waiting in down_read in __do_page_fault the
    same problem would emerge.
    
    In short it's a tradeoff between the accuracy of the OOM killer (not
    causing spurious allocation failures in addition to killing the task)
    and the risk of livelock.
    
    Furthermore the more we hold on this change, the more likely those
    ext4 allocations done by kernel thread (never picked by the OOM
    killer, which would set TIF_MEMDIE and let alloc_pages fail once in a
    while) will never be exercised.
    
    After the fix the identical trinity load that reliably reproduces the
    problem completes and in addition to the OOM killer info, as expected
    I get this in the kernel log (instead of the below allocation error,
    I'd get the livelock earlier).
    
    kworker/u16:0: page allocation failure: order:0, mode:0x50
    CPU: 2 PID: 6006 Comm: kworker/u16:0 Tainted: G        W     3.16.0-rc1+ #6
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Workqueue: writeback bdi_writeback_workfn (flush-254:0)
     0000000000000000 ffff880006d3f408 ffffffff81679b51 ffff88007fc8f068
     0000000000000050 ffff880006d3f498 ffffffff811748c2 0000000000000010
     ffffffffffffffff ffff88007fffc128 ffffffff810d0bed ffff880006d3f468
    Call Trace:
     [<ffffffff81679b51>] dump_stack+0x4e/0x68
     [<ffffffff811748c2>] warn_alloc_failed+0xe2/0x130
     [<ffffffff810d0bed>] ? trace_hardirqs_on+0xd/0x10
     [<ffffffff811791d0>] __alloc_pages_nodemask+0x910/0xb10
     [<ffffffff811becab>] alloc_pages_current+0x8b/0x120
     [<ffffffff8116e450>] __page_cache_alloc+0x10/0x20
     [<ffffffff8116f03c>] pagecache_get_page+0x7c/0x1f0
     [<ffffffff81299004>] ext4_mb_load_buddy+0x274/0x3b0
     [<ffffffff8129a3f2>] ext4_mb_regular_allocator+0x1e2/0x480
     [<ffffffff81296931>] ? ext4_mb_use_preallocated+0x31/0x600
     [<ffffffff8129de28>] ext4_mb_new_blocks+0x568/0x7f0
     [<ffffffff81290fd3>] ext4_ext_map_blocks+0x683/0x1970
     [<ffffffff81265688>] ext4_map_blocks+0x168/0x4d0
     [<ffffffff8126af77>] ext4_writepages+0x6e7/0x1030
     [<ffffffff8117ce81>] do_writepages+0x21/0x50
     [<ffffffff812096c0>] __writeback_single_inode+0x40/0x550
     [<ffffffff8120b311>] writeback_sb_inodes+0x281/0x560
     [<ffffffff8120b68f>] __writeback_inodes_wb+0x9f/0xd0
     [<ffffffff8120b94b>] wb_writeback+0x28b/0x510
     [<ffffffff8120befc>] bdi_writeback_workfn+0x11c/0x6a0
     [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
     [<ffffffff81092cf5>] process_one_work+0x1c5/0x620
     [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
     [<ffffffff8109376b>] worker_thread+0x11b/0x4f0
     [<ffffffff81093650>] ? init_pwq+0x190/0x190
     [<ffffffff8109ba54>] kthread+0xe4/0x100
     [<ffffffff8109b970>] ? __init_kthread_worker+0x70/0x70
     [<ffffffff8168162c>] ret_from_fork+0x7c/0xb0
    Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
    7636db0a
page_alloc.c 184 KB