• Guillaume Nault's avatar
    net: Introduce sk_use_task_frag in struct sock. · fb87bd47
    Guillaume Nault authored
    Sockets that can be used while recursing into memory reclaim, like
    those used by network block devices and file systems, mustn't use
    current->task_frag: if the current process is already using it, then
    the inner memory reclaim call would corrupt the task_frag structure.
    
    To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
    that mustn't use current->task_frag, assuming that those used during
    memory reclaim had their allocation constraints reflected in
    ->sk_allocation.
    
    This unfortunately doesn't cover all cases: in an attempt to remove all
    usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
    ->sk_allocation, and used memalloc_nofs critical sections instead.
    This breaks the sk_page_frag() heuristic since the allocation
    constraints are now stored in current->flags, which sk_page_frag()
    can't read without risking triggering a cache miss and slowing down
    TCP's fast path.
    
    This patch creates a new field in struct sock, named sk_use_task_frag,
    which sockets with memory reclaim constraints can set to false if they
    can't safely use current->task_frag. In such cases, sk_page_frag() now
    always returns the socket's page_frag (->sk_frag). The first user is
    sunrpc, which needs to avoid using current->task_frag but can keep
    ->sk_allocation set to GFP_KERNEL otherwise.
    
    Eventually, it might be possible to simplify sk_page_frag() by only
    testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
    heuristic entirely (assuming other sockets will set ->sk_use_task_frag
    according to their constraints in the future).
    
    The new ->sk_use_task_frag field is placed in a hole in struct sock and
    belongs to a cache line shared with ->sk_shutdown. Therefore it should
    be hot and shouldn't have negative performance impacts on TCP's fast
    path (sk_shutdown is tested just before the while() loop in
    tcp_sendmsg_locked()).
    
    Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
    Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
    Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    fb87bd47
sock.c 98.1 KB