Commit 7636db0a authored by Andrea Arcangeli's avatar Andrea Arcangeli

mm: ext4 livelock during OOM

I can easily reproduce a livelock with some trinity load on a 2GB
guest running in parallel:

	./trinity -X -c remap_anon_pages -q
	./trinity -X -c userfaultfd -q

The last OOM killer invocation selects this task:

Out of memory: Kill process 6537 (trinity-c6) score 106 or sacrifice child
Killed process 6537 (trinity-c6) total-vm:414772kB, anon-rss:186744kB, file-rss:560kB

The victim task shortly later is detected in uninterruptible state for
too long by the hangcheck timer:

INFO: task trinity-c6:6537 blocked for more than 120 seconds.
      Not tainted 3.16.0-rc1+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
trinity-c6      D ffff88004ec37b50 11080  6537   5530 0x00100004
 ffff88004ec37aa8 0000000000000082 0000000000000000 ffff880039174910
 ffff88004ec37fd8 0000000000004000 ffff88007c8de3d0 ffff880039174910
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8167c759>] ? schedule+0x29/0x70
 [<ffffffff8112ae42>] ? __delayacct_blkio_start+0x22/0x30
 [<ffffffff8116e290>] ? __lock_page+0x70/0x70
 [<ffffffff8167c759>] schedule+0x29/0x70
 [<ffffffff8167c82f>] io_schedule+0x8f/0xd0
 [<ffffffff8116e29e>] sleep_on_page+0xe/0x20
 [<ffffffff8167cd33>] __wait_on_bit_lock+0x73/0xb0
 [<ffffffff8116e287>] __lock_page+0x67/0x70
 [<ffffffff810c34d0>] ? wake_atomic_t_function+0x40/0x40
 [<ffffffff8116f125>] pagecache_get_page+0x165/0x1f0
 [<ffffffff8116f3d4>] grab_cache_page_write_begin+0x34/0x50
 [<ffffffff81268f82>] ext4_da_write_begin+0x92/0x380
 [<ffffffff8116d717>] generic_perform_write+0xc7/0x1d0
 [<ffffffff8116fee3>] __generic_file_write_iter+0x173/0x350
 [<ffffffff8125e6ad>] ext4_file_write_iter+0x10d/0x3c0
 [<ffffffff811db72b>] ? vfs_write+0x1bb/0x1f0
 [<ffffffff811da581>] new_sync_write+0x81/0xb0
 [<ffffffff811db62f>] vfs_write+0xbf/0x1f0
 [<ffffffff811dbb72>] SyS_write+0x52/0xc0
 [<ffffffff816816d2>] system_call_fastpath+0x16/0x1b
3 locks held by trinity-c6/6537:
 #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811fc0ee>] __fdget_pos+0x3e/0x50
 #1:  (sb_writers#3){.+.+.+}, at: [<ffffffff811db72b>] vfs_write+0x1bb/0x1f0
 #2:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8125e62d>] ext4_file_write_iter+0x8d/0x3c0

The task that holds the page lock is likely the below one that never
returns from __alloc_pages_slowpath.

ck_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2661
ion>) at mm/page_alloc.c:2821
0, radix_gfp_mask=0x50) at mm/filemap.c:1096
emap.h:336
 at fs/ext4/mballoc.c:4442
50, flags=0x25) at fs/ext4/extents.c:4453
flags=0x25) at fs/ext4/inode.c:648
64

(full stack trace of the !__GFP_FS allocation in the kernel thread
holding the page lock below)

gfp_mask in __alloc_pages_slowpath is gfp_mask=0x50, so
___GFP_IO|___GFP_WAIT. ext4_writepages run from a kworker kernel
thread is holding the page lock that the OOM victim task is waiting
on.

If alloc_pages returned NULL the whole livelock would resolve itself
(-ENOMEM would be returned all the way up, ext4 thinks it can handle
it, in reality it cannot but that's for a later patch in this series).
ext4_writepages would return and the kworker would try again later to
flush the dirty pages in the dirty inodes.

To verify I breakpointed in the should_alloc_retry and added
__GFP_NORETRY to the gfp_mask just before the __GFP_NORETRY check.

gdb> b mm/page_alloc.c:2185
Breakpoint 8 at 0xffffffff81179122: file mm/page_alloc.c, line 2185.
gdb> c
Continuing.
[Switching to Thread 1]
_______________________________________________________________________________
     eax:00000000 ebx:00000050  ecx:00000001  edx:00000001     eflags:00000206
     esi:0000196A edi:2963AA50  esp:4EDEF448  ebp:4EDEF568     eip:Error while running hook_stop:
Value can't be converted to integer.

Breakpoint 8, __alloc_pages_slowpath (migratetype=0x0, classzone_idx=0x1, preferred_zone=0xffff88007fffa840, nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2713

I set the breakpoint at 2185 and it stopped at 2713 but 2713 is in the
middle of some comment, I assume that's addr2line imperfection and
it's not relevant.

Then I simply added __GFP_NORETRY to the gfp_mask in the stack:

gdb> print gfp_mask
$1 = 0x50
gdb> set gfp_mask = 0x1050
gdb> p gfp_mask
$2 = 0x1050
gdb> c
Continuing.

After that the livelock resolved itself immediately, the OOM victim
quit and the workload continued without errors.

The problem was probably introduced in commit
11e33f6a .

	/*
	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
	 * means __GFP_NOFAIL, but that may not be true in other
	 * implementations.
	 */
	if (order <= PAGE_ALLOC_COSTLY_ORDER)
		return 1;

Retrying forever and depending on the OOM killer to send a SIGKILL is
only ok if the victim task isn't sleeping in uninterruptible state
waiting in this case a kernel thread to release a page lock.

In this case the kernel thread holding the lock would never be picked
by the OOM killer in the first place so this is a error path in ext4
probably never exercised.

The objective of an implicit __GFP_NOFAIL behavior (but that fails the
allocation if TIF_MEMDIE is set on the task, unlike a real
__GFP_NOFAIL that never fails) I assume is to avoid spurious
VM_FAULT_OOM to make the OOM killer more reliable, but I don't think
an almost implicit __GFP_NOFAIL is safe in general. __GFP_NOFAIL is
unsafe too in fact but at least it's very rarely used.

For now we can start by letting the allocations that hold lowlevel
filesystem locks (__GFP_FS clear) fail so they can release those
locks. Those locks tends to be uninterruptible too.

This will reduce the scope of the problem, but I'd rather prefer to
drop that entire check quoted above in should_alloc_retry though. If a
kernel thread would use_mm() and then take the mmap_sem for writing,
and then run a GFP_KERNEL allocation that invokes the OOM killer to
kill one process that is waiting in down_read in __do_page_fault the
same problem would emerge.

In short it's a tradeoff between the accuracy of the OOM killer (not
causing spurious allocation failures in addition to killing the task)
and the risk of livelock.

Furthermore the more we hold on this change, the more likely those
ext4 allocations done by kernel thread (never picked by the OOM
killer, which would set TIF_MEMDIE and let alloc_pages fail once in a
while) will never be exercised.

After the fix the identical trinity load that reliably reproduces the
problem completes and in addition to the OOM killer info, as expected
I get this in the kernel log (instead of the below allocation error,
I'd get the livelock earlier).

kworker/u16:0: page allocation failure: order:0, mode:0x50
CPU: 2 PID: 6006 Comm: kworker/u16:0 Tainted: G        W     3.16.0-rc1+ #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: writeback bdi_writeback_workfn (flush-254:0)
 0000000000000000 ffff880006d3f408 ffffffff81679b51 ffff88007fc8f068
 0000000000000050 ffff880006d3f498 ffffffff811748c2 0000000000000010
 ffffffffffffffff ffff88007fffc128 ffffffff810d0bed ffff880006d3f468
Call Trace:
 [<ffffffff81679b51>] dump_stack+0x4e/0x68
 [<ffffffff811748c2>] warn_alloc_failed+0xe2/0x130
 [<ffffffff810d0bed>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811791d0>] __alloc_pages_nodemask+0x910/0xb10
 [<ffffffff811becab>] alloc_pages_current+0x8b/0x120
 [<ffffffff8116e450>] __page_cache_alloc+0x10/0x20
 [<ffffffff8116f03c>] pagecache_get_page+0x7c/0x1f0
 [<ffffffff81299004>] ext4_mb_load_buddy+0x274/0x3b0
 [<ffffffff8129a3f2>] ext4_mb_regular_allocator+0x1e2/0x480
 [<ffffffff81296931>] ? ext4_mb_use_preallocated+0x31/0x600
 [<ffffffff8129de28>] ext4_mb_new_blocks+0x568/0x7f0
 [<ffffffff81290fd3>] ext4_ext_map_blocks+0x683/0x1970
 [<ffffffff81265688>] ext4_map_blocks+0x168/0x4d0
 [<ffffffff8126af77>] ext4_writepages+0x6e7/0x1030
 [<ffffffff8117ce81>] do_writepages+0x21/0x50
 [<ffffffff812096c0>] __writeback_single_inode+0x40/0x550
 [<ffffffff8120b311>] writeback_sb_inodes+0x281/0x560
 [<ffffffff8120b68f>] __writeback_inodes_wb+0x9f/0xd0
 [<ffffffff8120b94b>] wb_writeback+0x28b/0x510
 [<ffffffff8120befc>] bdi_writeback_workfn+0x11c/0x6a0
 [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
 [<ffffffff81092cf5>] process_one_work+0x1c5/0x620
 [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
 [<ffffffff8109376b>] worker_thread+0x11b/0x4f0
 [<ffffffff81093650>] ? init_pwq+0x190/0x190
 [<ffffffff8109ba54>] kthread+0xe4/0x100
 [<ffffffff8109b970>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff8168162c>] ret_from_fork+0x7c/0xb0
Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
parent fb684033
......@@ -2344,11 +2344,22 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
return 0;
/*
* In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
* means __GFP_NOFAIL, but that may not be true in other
* implementations.
* In this implementation order <= PAGE_ALLOC_COSTLY_ORDER
* &&__GFP_FS means __GFP_NOFAIL (until the OOM killer sets
* TIF_MEMDIE, then it will fail unlike a real __GFP_NOFAIL),
* but that may not be true in other implementations. This
* behavior increases the accuracy of the OOM killer so that
* there will not be spurious VM_FAULT_OOM errors or other
* allocation failures until the OOM killed task releases its
* memory. Not failing allocations that can enter the
* filesystems (which means the holder is very likely holding
* uninterruptible locks) poses a great risk of livelock (like
* if ext4_writepages invoked by a kernel thread doing
* writeback_sb_inodes is not allowed to fail while holding
* the PG_lock and in turn preventing the OOM killer victim
* task to exit).
*/
if (order <= PAGE_ALLOC_COSTLY_ORDER)
if (order <= PAGE_ALLOC_COSTLY_ORDER && (gfp_mask & __GFP_FS))
return 1;
/*
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment