• Mike Christie's avatar
    prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim · 8d19f1c8
    Mike Christie authored
    There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
    amd nbd that have userspace components that can run in the IO path. For
    example, iscsi and nbd's userspace deamons may need to recreate a socket
    and/or send IO on it, and dm-multipath's daemon multipathd may need to
    send SG IO or read/write IO to figure out the state of paths and re-set
    them up.
    
    In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
    memalloc_*_save/restore functions to control the allocation behavior,
    but for userspace we would end up hitting an allocation that ended up
    writing data back to the same device we are trying to allocate for.
    The device is then in a state of deadlock, because to execute IO the
    device needs to allocate memory, but to allocate memory the memory
    layers want execute IO to the device.
    
    Here is an example with nbd using a local userspace daemon that performs
    network IO to a remote server. We are using XFS on top of the nbd device,
    but it can happen with any FS or other modules layered on top of the nbd
    device that can write out data to free memory.  Here a nbd daemon helper
    thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
    a request. This kicks off a reclaim operation which results in a WRITE to
    the nbd device and the nbd thread calling back into the mm layer.
    
    [ 1626.609191] msgr-worker-1   D    0  1026      1 0x00004000
    [ 1626.609193] Call Trace:
    [ 1626.609195]  ? __schedule+0x29b/0x630
    [ 1626.609197]  ? wait_for_completion+0xe0/0x170
    [ 1626.609198]  schedule+0x30/0xb0
    [ 1626.609200]  schedule_timeout+0x1f6/0x2f0
    [ 1626.609202]  ? blk_finish_plug+0x21/0x2e
    [ 1626.609204]  ? _xfs_buf_ioapply+0x2e6/0x410
    [ 1626.609206]  ? wait_for_completion+0xe0/0x170
    [ 1626.609208]  wait_for_completion+0x108/0x170
    [ 1626.609210]  ? wake_up_q+0x70/0x70
    [ 1626.609212]  ? __xfs_buf_submit+0x12e/0x250
    [ 1626.609214]  ? xfs_bwrite+0x25/0x60
    [ 1626.609215]  xfs_buf_iowait+0x22/0xf0
    [ 1626.609218]  __xfs_buf_submit+0x12e/0x250
    [ 1626.609220]  xfs_bwrite+0x25/0x60
    [ 1626.609222]  xfs_reclaim_inode+0x2e8/0x310
    [ 1626.609224]  xfs_reclaim_inodes_ag+0x1b6/0x300
    [ 1626.609227]  xfs_reclaim_inodes_nr+0x31/0x40
    [ 1626.609228]  super_cache_scan+0x152/0x1a0
    [ 1626.609231]  do_shrink_slab+0x12c/0x2d0
    [ 1626.609233]  shrink_slab+0x9c/0x2a0
    [ 1626.609235]  shrink_node+0xd7/0x470
    [ 1626.609237]  do_try_to_free_pages+0xbf/0x380
    [ 1626.609240]  try_to_free_pages+0xd9/0x1f0
    [ 1626.609245]  __alloc_pages_slowpath+0x3a4/0xd30
    [ 1626.609251]  ? ___slab_alloc+0x238/0x560
    [ 1626.609254]  __alloc_pages_nodemask+0x30c/0x350
    [ 1626.609259]  skb_page_frag_refill+0x97/0xd0
    [ 1626.609274]  sk_page_frag_refill+0x1d/0x80
    [ 1626.609279]  tcp_sendmsg_locked+0x2bb/0xdd0
    [ 1626.609304]  tcp_sendmsg+0x27/0x40
    [ 1626.609307]  sock_sendmsg+0x54/0x60
    [ 1626.609308]  ___sys_sendmsg+0x29f/0x320
    [ 1626.609313]  ? sock_poll+0x66/0xb0
    [ 1626.609318]  ? ep_item_poll.isra.15+0x40/0xc0
    [ 1626.609320]  ? ep_send_events_proc+0xe6/0x230
    [ 1626.609322]  ? hrtimer_try_to_cancel+0x54/0xf0
    [ 1626.609324]  ? ep_read_events_proc+0xc0/0xc0
    [ 1626.609326]  ? _raw_write_unlock_irq+0xa/0x20
    [ 1626.609327]  ? ep_scan_ready_list.constprop.19+0x218/0x230
    [ 1626.609329]  ? __hrtimer_init+0xb0/0xb0
    [ 1626.609331]  ? _raw_spin_unlock_irq+0xa/0x20
    [ 1626.609334]  ? ep_poll+0x26c/0x4a0
    [ 1626.609337]  ? tcp_tsq_write.part.54+0xa0/0xa0
    [ 1626.609339]  ? release_sock+0x43/0x90
    [ 1626.609341]  ? _raw_spin_unlock_bh+0xa/0x20
    [ 1626.609342]  __sys_sendmsg+0x47/0x80
    [ 1626.609347]  do_syscall_64+0x5f/0x1c0
    [ 1626.609349]  ? prepare_exit_to_usermode+0x75/0xa0
    [ 1626.609351]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    This patch adds a new prctl command that daemons can use after they have
    done their initial setup, and before they start to do allocations that
    are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
    flags so both userspace block and FS threads can use it to avoid the
    allocation recursion and try to prevent from being throttled while
    writing out data to free up memory.
    Signed-off-by: default avatarMike Christie <mchristi@redhat.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Tested-by: default avatarMasato Suzuki <masato.suzuki@wdc.com>
    Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.comSigned-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    8d19f1c8
sys.c 62.5 KB