• Linus Torvalds's avatar
    mm: avoid 'might_sleep()' in get_mmap_lock_carefully() · 4542057e
    Linus Torvalds authored
    This might_sleep() goes back a long time: it was originally introduced
    way back when by commit 01006074 ("x86: add might_sleep() to
    do_page_fault()"), and made it into the generic VM code when the x86
    fault path got re-organized and generalized in commit c2508ec5 ("mm:
    introduce new 'lock_mm_and_find_vma()' page fault helper").
    
    However, it turns out that the placement of that might_sleep() has
    always been rather questionable simply because it's not only a debug
    statement to warn about sleeping in contexts that shouldn't sleep (which
    was the original reason for adding it), but it also implies a voluntary
    scheduling point.
    
    That, in turn, is less than desirable for two reasons:
    
     (a) it ends up being done after we successfully got the mmap_lock, so
         just as we got the lock we will now eagerly schedule away and
         increase lock contention
    
    and
    
     (b) this is all very possibly part of the "oops, things went horribly
         wrong" path and we just haven't figured that out yet
    
    After all, the whole _reason_ for having that get_mmap_lock_carefully()
    rather than just doing the obvious mmap_read_lock() is because this code
    wants to deal somewhat gracefully with potential kernel wild pointer
    bugs.
    
    So then a voluntary scheduling point here is simply not a good idea.
    
    We could certainly turn the 'might_sleep()' into a '__might_sleep()' and
    make it be just the debug check that it was originally intended to be.
    
    But even that seems questionable in the wild kernel pointer case - which
    again is part of the whole point of this code.  The problem wouldn't be
    about the _sleeping_ part of the page fault, but about a bad kernel
    access.  The fact that that bad kernel access might happen in a section
    that you shouldn't sleep in is secondary.
    
    So it really ends up being the case that this is simply entirely the
    wrong place to do this debug check and related scheduling point at all.
    
    So let's just remove the check entirely.  It's been around for over a
    decade, it has served its purpose.
    
    The re-schedule will happen at return to user space anyway for the
    normal case, and the warning - if we even need it - might be better off
    done as a special case for "page fault from kernel mode" once we've
    dealt with any potential kernel oopses where the oops is the relevant
    thing, not some artificial "scheduling while atomic" test.
    Reported-by: default avatarMateusz Guzik <mjguzik@gmail.com>
    Link: https://lore.kernel.org/lkml/20230820104303.2083444-1-mjguzik@gmail.com/
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    4542057e
memory.c 165 KB