• Paolo Abeni's avatar
    epoll: use refcount to reduce ep_mutex contention · 58c9b016
    Paolo Abeni authored
    We are observing huge contention on the epmutex during an http
    connection/rate test:
    
     83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
    [...]
               |--66.96%--__fput
                          |--60.04%--eventpoll_release_file
                                     |--58.41%--__mutex_lock.isra.6
                                               |--56.56%--osq_lock
    
    The application is multi-threaded, creates a new epoll entry for
    each incoming connection, and does not delete it before the
    connection shutdown - that is, before the connection's fd close().
    
    Many different threads compete frequently for the epmutex lock,
    affecting the overall performance.
    
    To reduce the contention this patch introduces explicit reference counting
    for the eventpoll struct. Each registered event acquires a reference,
    and references are released at ep_remove() time.
    
    The eventpoll struct is released by whoever - among EP file close() and
    and the monitored file close() drops its last reference.
    
    Additionally, this introduces a new 'dying' flag to prevent races between
    the EP file close() and the monitored file close().
    ep_eventpoll_release() marks, under f_lock spinlock, each epitem as dying
    before removing it, while EP file close() does not touch dying epitems.
    
    The above is needed as both close operations could run concurrently and
    drop the EP reference acquired via the epitem entry. Without the above
    flag, the monitored file close() could reach the EP struct via the epitem
    list while the epitem is still listed and then try to put it after its
    disposal.
    
    An alternative could be avoiding touching the references acquired via
    the epitems at EP file close() time, but that could leave the EP struct
    alive for potentially unlimited time after EP file close(), with nasty
    side effects.
    
    With all the above in place, we can drop the epmutex usage at disposal time.
    
    Overall this produces a significant performance improvement in the
    mentioned connection/rate scenario: the mutex operations disappear from
    the topmost offenders in the perf report, and the measured connections/rate
    grows by ~60%.
    
    To make the change more readable this additionally renames ep_free() to
    ep_clear_and_put(), and moves the actual memory cleanup in a separate
    ep_free() helper.
    
    Link: https://lkml.kernel.org/r/4a57788dcaf28f5eb4f8dfddcc3a8b172a7357bb.1679504153.git.pabeni@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
    Co-developed-by: default avatarEric Dumazet <edumazet@google.com>
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Tested-by: default avatarXiumei Mu <xmu@redhiat.com>
    Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
    Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Carlos Maiolino <cmaiolino@redhat.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Jacob Keller <jacob.e.keller@intel.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    58c9b016
eventpoll.c 65.6 KB