mm: ext4 livelock during OOM
I can easily reproduce a livelock with some trinity load on a 2GB guest running in parallel: ./trinity -X -c remap_anon_pages -q ./trinity -X -c userfaultfd -q The last OOM killer invocation selects this task: Out of memory: Kill process 6537 (trinity-c6) score 106 or sacrifice child Killed process 6537 (trinity-c6) total-vm:414772kB, anon-rss:186744kB, file-rss:560kB The victim task shortly later is detected in uninterruptible state for too long by the hangcheck timer: INFO: task trinity-c6:6537 blocked for more than 120 seconds. Not tainted 3.16.0-rc1+ #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. trinity-c6 D ffff88004ec37b50 11080 6537 5530 0x00100004 ffff88004ec37aa8 0000000000000082 0000000000000000 ffff880039174910 ffff88004ec37fd8 0000000000004000 ffff88007c8de3d0 ffff880039174910 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8167c759>] ? schedule+0x29/0x70 [<ffffffff8112ae42>] ? __delayacct_blkio_start+0x22/0x30 [<ffffffff8116e290>] ? __lock_page+0x70/0x70 [<ffffffff8167c759>] schedule+0x29/0x70 [<ffffffff8167c82f>] io_schedule+0x8f/0xd0 [<ffffffff8116e29e>] sleep_on_page+0xe/0x20 [<ffffffff8167cd33>] __wait_on_bit_lock+0x73/0xb0 [<ffffffff8116e287>] __lock_page+0x67/0x70 [<ffffffff810c34d0>] ? wake_atomic_t_function+0x40/0x40 [<ffffffff8116f125>] pagecache_get_page+0x165/0x1f0 [<ffffffff8116f3d4>] grab_cache_page_write_begin+0x34/0x50 [<ffffffff81268f82>] ext4_da_write_begin+0x92/0x380 [<ffffffff8116d717>] generic_perform_write+0xc7/0x1d0 [<ffffffff8116fee3>] __generic_file_write_iter+0x173/0x350 [<ffffffff8125e6ad>] ext4_file_write_iter+0x10d/0x3c0 [<ffffffff811db72b>] ? vfs_write+0x1bb/0x1f0 [<ffffffff811da581>] new_sync_write+0x81/0xb0 [<ffffffff811db62f>] vfs_write+0xbf/0x1f0 [<ffffffff811dbb72>] SyS_write+0x52/0xc0 [<ffffffff816816d2>] system_call_fastpath+0x16/0x1b 3 locks held by trinity-c6/6537: #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811fc0ee>] __fdget_pos+0x3e/0x50 #1: (sb_writers#3){.+.+.+}, at: [<ffffffff811db72b>] vfs_write+0x1bb/0x1f0 #2: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8125e62d>] ext4_file_write_iter+0x8d/0x3c0 The task that holds the page lock is likely the below one that never returns from __alloc_pages_slowpath. ck_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2661 ion>) at mm/page_alloc.c:2821 0, radix_gfp_mask=0x50) at mm/filemap.c:1096 emap.h:336 at fs/ext4/mballoc.c:4442 50, flags=0x25) at fs/ext4/extents.c:4453 flags=0x25) at fs/ext4/inode.c:648 64 (full stack trace of the !__GFP_FS allocation in the kernel thread holding the page lock below) gfp_mask in __alloc_pages_slowpath is gfp_mask=0x50, so ___GFP_IO|___GFP_WAIT. ext4_writepages run from a kworker kernel thread is holding the page lock that the OOM victim task is waiting on. If alloc_pages returned NULL the whole livelock would resolve itself (-ENOMEM would be returned all the way up, ext4 thinks it can handle it, in reality it cannot but that's for a later patch in this series). ext4_writepages would return and the kworker would try again later to flush the dirty pages in the dirty inodes. To verify I breakpointed in the should_alloc_retry and added __GFP_NORETRY to the gfp_mask just before the __GFP_NORETRY check. gdb> b mm/page_alloc.c:2185 Breakpoint 8 at 0xffffffff81179122: file mm/page_alloc.c, line 2185. gdb> c Continuing. [Switching to Thread 1] _______________________________________________________________________________ eax:00000000 ebx:00000050 ecx:00000001 edx:00000001 eflags:00000206 esi:0000196A edi:2963AA50 esp:4EDEF448 ebp:4EDEF568 eip:Error while running hook_stop: Value can't be converted to integer. Breakpoint 8, __alloc_pages_slowpath (migratetype=0x0, classzone_idx=0x1, preferred_zone=0xffff88007fffa840, nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2713 I set the breakpoint at 2185 and it stopped at 2713 but 2713 is in the middle of some comment, I assume that's addr2line imperfection and it's not relevant. Then I simply added __GFP_NORETRY to the gfp_mask in the stack: gdb> print gfp_mask $1 = 0x50 gdb> set gfp_mask = 0x1050 gdb> p gfp_mask $2 = 0x1050 gdb> c Continuing. After that the livelock resolved itself immediately, the OOM victim quit and the workload continued without errors. The problem was probably introduced in commit 11e33f6a . /* * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER * means __GFP_NOFAIL, but that may not be true in other * implementations. */ if (order <= PAGE_ALLOC_COSTLY_ORDER) return 1; Retrying forever and depending on the OOM killer to send a SIGKILL is only ok if the victim task isn't sleeping in uninterruptible state waiting in this case a kernel thread to release a page lock. In this case the kernel thread holding the lock would never be picked by the OOM killer in the first place so this is a error path in ext4 probably never exercised. The objective of an implicit __GFP_NOFAIL behavior (but that fails the allocation if TIF_MEMDIE is set on the task, unlike a real __GFP_NOFAIL that never fails) I assume is to avoid spurious VM_FAULT_OOM to make the OOM killer more reliable, but I don't think an almost implicit __GFP_NOFAIL is safe in general. __GFP_NOFAIL is unsafe too in fact but at least it's very rarely used. For now we can start by letting the allocations that hold lowlevel filesystem locks (__GFP_FS clear) fail so they can release those locks. Those locks tends to be uninterruptible too. This will reduce the scope of the problem, but I'd rather prefer to drop that entire check quoted above in should_alloc_retry though. If a kernel thread would use_mm() and then take the mmap_sem for writing, and then run a GFP_KERNEL allocation that invokes the OOM killer to kill one process that is waiting in down_read in __do_page_fault the same problem would emerge. In short it's a tradeoff between the accuracy of the OOM killer (not causing spurious allocation failures in addition to killing the task) and the risk of livelock. Furthermore the more we hold on this change, the more likely those ext4 allocations done by kernel thread (never picked by the OOM killer, which would set TIF_MEMDIE and let alloc_pages fail once in a while) will never be exercised. After the fix the identical trinity load that reliably reproduces the problem completes and in addition to the OOM killer info, as expected I get this in the kernel log (instead of the below allocation error, I'd get the livelock earlier). kworker/u16:0: page allocation failure: order:0, mode:0x50 CPU: 2 PID: 6006 Comm: kworker/u16:0 Tainted: G W 3.16.0-rc1+ #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback bdi_writeback_workfn (flush-254:0) 0000000000000000 ffff880006d3f408 ffffffff81679b51 ffff88007fc8f068 0000000000000050 ffff880006d3f498 ffffffff811748c2 0000000000000010 ffffffffffffffff ffff88007fffc128 ffffffff810d0bed ffff880006d3f468 Call Trace: [<ffffffff81679b51>] dump_stack+0x4e/0x68 [<ffffffff811748c2>] warn_alloc_failed+0xe2/0x130 [<ffffffff810d0bed>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811791d0>] __alloc_pages_nodemask+0x910/0xb10 [<ffffffff811becab>] alloc_pages_current+0x8b/0x120 [<ffffffff8116e450>] __page_cache_alloc+0x10/0x20 [<ffffffff8116f03c>] pagecache_get_page+0x7c/0x1f0 [<ffffffff81299004>] ext4_mb_load_buddy+0x274/0x3b0 [<ffffffff8129a3f2>] ext4_mb_regular_allocator+0x1e2/0x480 [<ffffffff81296931>] ? ext4_mb_use_preallocated+0x31/0x600 [<ffffffff8129de28>] ext4_mb_new_blocks+0x568/0x7f0 [<ffffffff81290fd3>] ext4_ext_map_blocks+0x683/0x1970 [<ffffffff81265688>] ext4_map_blocks+0x168/0x4d0 [<ffffffff8126af77>] ext4_writepages+0x6e7/0x1030 [<ffffffff8117ce81>] do_writepages+0x21/0x50 [<ffffffff812096c0>] __writeback_single_inode+0x40/0x550 [<ffffffff8120b311>] writeback_sb_inodes+0x281/0x560 [<ffffffff8120b68f>] __writeback_inodes_wb+0x9f/0xd0 [<ffffffff8120b94b>] wb_writeback+0x28b/0x510 [<ffffffff8120befc>] bdi_writeback_workfn+0x11c/0x6a0 [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620 [<ffffffff81092cf5>] process_one_work+0x1c5/0x620 [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620 [<ffffffff8109376b>] worker_thread+0x11b/0x4f0 [<ffffffff81093650>] ? init_pwq+0x190/0x190 [<ffffffff8109ba54>] kthread+0xe4/0x100 [<ffffffff8109b970>] ? __init_kthread_worker+0x70/0x70 [<ffffffff8168162c>] ret_from_fork+0x7c/0xb0 Signed-off-by:Andrea Arcangeli <aarcange@redhat.com>
Showing
Please register or sign in to comment