• Coly Li's avatar
    md: use memalloc scope APIs in mddev_suspend()/mddev_resume() · 78f57ef9
    Coly Li authored
    In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
    flag, then it is sent into kvmalloc_array() inside scribble_alloc().
    
    The problem is kvmalloc_array() eventually calls kvmalloc_node() which
    does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
    kmalloc_node() is called indeed to allocate physically continuous
    pages. When system memory is under heavy pressure, and the requesting
    size is large, there is high probability that allocating continueous
    pages will fail.
    
    But simply using GFP_KERNEL flag to call kvmalloc_array() is also
    progblematic. In the code path where scribble_alloc() is called, the
    raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
    and such I/Os go back to the suspend raid array, deadlock will happen.
    
    What is desired here is to allocate non-physically (a.k.a virtually)
    continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
    to use the mmealloc sceope APIs to restrict memory reclaim I/O in
    allocating context, specifically to call memalloc_noio_save() when
    suspend the raid array and to call memalloc_noio_restore() when
    resume the raid array.
    
    This patch adds the memalloc scope APIs in mddev_suspend() and
    mddev_resume(), to restrict memory reclaim I/Os during the raid array
    is suspended. The benifit of adding the memalloc scope API in the
    unified entry point mddev_suspend()/mddev_resume() is, no matter which
    md raid array type (personality), we are sure the deadlock by recursive
    memory reclaim I/O won't happen on the suspending context.
    
    Please notice that the memalloc scope APIs only take effect on the raid
    array suspending context, if the memory allocation is from another new
    created kthread after raid array suspended, the recursive memory reclaim
    I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
    used for the critical section where the raid metadata is modifying,
    creating a kthread to allocate memory inside the critical section is
    queer and very probably being buggy.
    
    Fixes: b330e6a4 ("md: convert to kvmalloc")
    Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
    Signed-off-by: default avatarColy Li <colyli@suse.de>
    Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
    78f57ef9
md.c 256 KB