• Tejun Heo's avatar
    net: fix sk_page_frag() recursion from memory reclaim · 20eb4f29
    Tejun Heo authored
    sk_page_frag() optimizes skb_frag allocations by using per-task
    skb_frag cache when it knows it's the only user.  The condition is
    determined by seeing whether the socket allocation mask allows
    blocking - if the allocation may block, it obviously owns the task's
    context and ergo exclusively owns current->task_frag.
    
    Unfortunately, this misses recursion through memory reclaim path.
    Please take a look at the following backtrace.
    
     [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
         ...
         tcp_sendmsg+0x27/0x40
         sock_sendmsg+0x30/0x40
         sock_xmit.isra.24+0xa1/0x170 [nbd]
         nbd_send_cmd+0x1d2/0x690 [nbd]
         nbd_queue_rq+0x1b5/0x3b0 [nbd]
         __blk_mq_try_issue_directly+0x108/0x1b0
         blk_mq_request_issue_directly+0xbd/0xe0
         blk_mq_try_issue_list_directly+0x41/0xb0
         blk_mq_sched_insert_requests+0xa2/0xe0
         blk_mq_flush_plug_list+0x205/0x2a0
         blk_flush_plug_list+0xc3/0xf0
     [1] blk_finish_plug+0x21/0x2e
         _xfs_buf_ioapply+0x313/0x460
         __xfs_buf_submit+0x67/0x220
         xfs_buf_read_map+0x113/0x1a0
         xfs_trans_read_buf_map+0xbf/0x330
         xfs_btree_read_buf_block.constprop.42+0x95/0xd0
         xfs_btree_lookup_get_block+0x95/0x170
         xfs_btree_lookup+0xcc/0x470
         xfs_bmap_del_extent_real+0x254/0x9a0
         __xfs_bunmapi+0x45c/0xab0
         xfs_bunmapi+0x15/0x30
         xfs_itruncate_extents_flags+0xca/0x250
         xfs_free_eofblocks+0x181/0x1e0
         xfs_fs_destroy_inode+0xa8/0x1b0
         destroy_inode+0x38/0x70
         dispose_list+0x35/0x50
         prune_icache_sb+0x52/0x70
         super_cache_scan+0x120/0x1a0
         do_shrink_slab+0x120/0x290
         shrink_slab+0x216/0x2b0
         shrink_node+0x1b6/0x4a0
         do_try_to_free_pages+0xc6/0x370
         try_to_free_mem_cgroup_pages+0xe3/0x1e0
         try_charge+0x29e/0x790
         mem_cgroup_charge_skmem+0x6a/0x100
         __sk_mem_raise_allocated+0x18e/0x390
         __sk_mem_schedule+0x2a/0x40
     [0] tcp_sendmsg_locked+0x8eb/0xe10
         tcp_sendmsg+0x27/0x40
         sock_sendmsg+0x30/0x40
         ___sys_sendmsg+0x26d/0x2b0
         __sys_sendmsg+0x57/0xa0
         do_syscall_64+0x42/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    In [0], tcp_send_msg_locked() was using current->page_frag when it
    called sk_wmem_schedule().  It already calculated how many bytes can
    be fit into current->page_frag.  Due to memory pressure,
    sk_wmem_schedule() called into memory reclaim path which called into
    xfs and then IO issue path.  Because the filesystem in question is
    backed by nbd, the control goes back into the tcp layer - back into
    tcp_sendmsg_locked().
    
    nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
    sense - it's in the process of freeing memory and wants to be able to,
    e.g., drop clean pages to make forward progress.  However, this
    confused sk_page_frag() called from [2].  Because it only tests
    whether the allocation allows blocking which it does, it now thinks
    current->page_frag can be used again although it already was being
    used in [0].
    
    After [2] used current->page_frag, the offset would be increased by
    the used amount.  When the control returns to [0],
    current->page_frag's offset is increased and the previously calculated
    number of bytes now may overrun the end of allocated memory leading to
    silent memory corruptions.
    
    Fix it by adding gfpflags_normal_context() which tests sleepable &&
    !reclaim and use it to determine whether to use current->task_frag.
    
    v2: Eric didn't like gfp flags being tested twice.  Introduce a new
        helper gfpflags_normal_context() and combine the two tests.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    20eb4f29
sock.h 72.7 KB