[PATCH] filtered wakeups

From: William Lee Irwin III <wli@holomorphy.com> This patch series is solving the "thundering herd" problem that occurs in the mainline implementation of hashed waitqueues. There are two sources of spurious wakeups in such arrangements: (a) Hash collisions that place waiters on different objects on the same waitqueue, which wakes threads falsely when any of the objects hashed to the same queue receives a wakeup. i.e. loss of information about which object a wakeup event is related to. (b) Loss of information about which object a given waiter is waiting on. This precludes wake-one semantics for mutual exclusion scenarios. For instance, a lock bit may be slept on. If there are any waiters on the object, a lock bit release event must wake at least one of them so as to prevent deadlock. But without information as to which waiter is waiting on which object, we must resort to waking all waiters who could possibly be waiting on it. Now, as the lock bit provides mutual exclusion, only one of the waiters woken can proceed, and the remainder will go back to sleep and wait for another event, creating unnecessary system load. Once wake-one semantics are established, only one of the waiters waiting to acquire a lock bit need to be woken, which measurably reduces system load and improves efficiency (i.e. it's the subject of the benchmarking I've been sending to you). Even beyond the measurable efficiency gains, there are reasons of robustness and responsiveness to motivate addressing the issue of thundering herds. In a real-life scenario I've been personally involved in resolving, the thundering herd issue caused powerful modern SMP machines with fast IO systems to be unresponsive to user input for a minute at a time or more. Analogues of these patches for the distro kernels involved fully resolved the issue to the customer's satisfaction and obviated workarounds to limit the pagecache's size. The latest spin of these patches basically shoves more pieces of the logic into the wakeup functions, with some efficiency gains from sharing the hot codepath with the rest of the kernel, and a slightly larger diff than the patches with the newly-introduced entrypoint. Writing these was motivated by the push to insulate sched.c from more of the details of wakeup semantics by putting more of the logic into the wakeup functions. In order to accomplish this while still solving (b), the wakeup functions grew a new argument for communication about what object a wakeup event is related to to be passed by the waker. ========= This patch provides an additional argument to wakeup functions so that information may be passed from the waker to the waiter. This is provided as a separate patch so that the overhead of the additional argument can be measured in isolation. No change in performance was observable here.

[PATCH] filtered wakeups
From: William Lee Irwin III <wli@holomorphy.com> This patch series is solving the "thundering herd" problem that occurs in the mainline implementation of hashed waitqueues. There are two sources of spurious wakeups in such arrangements: (a) Hash collisions that place waiters on different objects on the same waitqueue, which wakes threads falsely when any of the objects hashed to the same queue receives a wakeup. i.e. loss of information about which object a wakeup event is related to. (b) Loss of information about which object a given waiter is waiting on. This precludes wake-one semantics for mutual exclusion scenarios. For instance, a lock bit may be slept on. If there are any waiters on the object, a lock bit release event must wake at least one of them so as to prevent deadlock. But without information as to which waiter is waiting on which object, we must resort to waking all waiters who could possibly be waiting on it. Now, as the lock bit provides mutual exclusion, only one of the waiters woken can proceed, and the remainder will go back to sleep and wait for another event, creating unnecessary system load. Once wake-one semantics are established, only one of the waiters waiting to acquire a lock bit need to be woken, which measurably reduces system load and improves efficiency (i.e. it's the subject of the benchmarking I've been sending to you). Even beyond the measurable efficiency gains, there are reasons of robustness and responsiveness to motivate addressing the issue of thundering herds. In a real-life scenario I've been personally involved in resolving, the thundering herd issue caused powerful modern SMP machines with fast IO systems to be unresponsive to user input for a minute at a time or more. Analogues of these patches for the distro kernels involved fully resolved the issue to the customer's satisfaction and obviated workarounds to limit the pagecache's size. The latest spin of these patches basically shoves more pieces of the logic into the wakeup functions, with some efficiency gains from sharing the hot codepath with the rest of the kernel, and a slightly larger diff than the patches with the newly-introduced entrypoint. Writing these was motivated by the push to insulate sched.c from more of the details of wakeup semantics by putting more of the logic into the wakeup functions. In order to accomplish this while still solving (b), the wakeup functions grew a new argument for communication about what object a wakeup event is related to to be passed by the waker. ========= This patch provides an additional argument to wakeup functions so that information may be passed from the waker to the waiter. This is provided as a separate patch so that the overhead of the additional argument can be measured in isolation. No change in performance was observable here.
2f242854 · Andrew Morton · Linus Torvalds · 5a930dd9 · 2f242854 · 2f242854
Commit 2f242854 authored May 14, 2004 by Andrew Morton Committed by Linus Torvalds May 14, 2004
Hide whitespace changes
Inline Side-by-side

Showing with 9 additions and 9 deletions

fs/eventpoll.c fs/eventpoll.c +2 -2

include/linux/wait.h include/linux/wait.h +3 -3

kernel/fork.c kernel/fork.c +2 -2

kernel/sched.c kernel/sched.c +2 -2

No files found.
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -309,7 +309,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi);
 static int ep_unlink(struct eventpoll *ep, struct epitem *epi);
 static int ep_remove(struct eventpoll *ep, struct epitem *epi);
-static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync);
+static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key);
 static int ep_eventpoll_close(struct inode *inode, struct file *file);
 static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait);
 static int ep_collect_ready_items(struct eventpoll *ep,
@@ -1296,7 +1296,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 * machanism. It is called by the stored file descriptors when they
 * have events to report.
 */
-static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync)
+static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	int pwake = 0;
 	unsigned long flags;

--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -17,8 +17,8 @@
 #include <asm/system.h>

 typedef struct __wait_queue wait_queue_t;
-typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int sync);
-extern int default_wake_function(wait_queue_t *wait, unsigned mode, int sync);
+typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int sync, void *key);
+int default_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);

 struct __wait_queue {
 	unsigned int flags;
@@ -240,7 +240,7 @@ void FASTCALL(prepare_to_wait(wait_queue_head_t *q,
 void FASTCALL(prepare_to_wait_exclusive(wait_queue_head_t *q,
 				wait_queue_t *wait, int state));
 void FASTCALL(finish_wait(wait_queue_head_t *q, wait_queue_t *wait));
-int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync);
+int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);

 #define DEFINE_WAIT(name)						\
 	wait_queue_t name = {						\

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -197,9 +197,9 @@ void fastcall finish_wait(wait_queue_head_t *q, wait_queue_t *wait)

 EXPORT_SYMBOL(finish_wait);

-int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync)
+int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
-	int ret = default_wake_function(wait, mode, sync);
+	int ret = default_wake_function(wait, mode, sync, key);

 	if (ret)
 		list_del_init(&wait->task_list);

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2302,7 +2302,7 @@ asmlinkage void __sched preempt_schedule(void)
 EXPORT_SYMBOL(preempt_schedule);
 #endif /* CONFIG_PREEMPT */

-int default_wake_function(wait_queue_t *curr, unsigned mode, int sync)
+int default_wake_function(wait_queue_t *curr, unsigned mode, int sync, void *key)
 {
 	task_t *p = curr->task;
 	return try_to_wake_up(p, mode, sync);
@@ -2329,7 +2329,7 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 		unsigned flags;
 		curr = list_entry(tmp, wait_queue_t, task_list);
 		flags = curr->flags;
-		if (curr->func(curr, mode, sync) &&
+		if (curr->func(curr, mode, sync, NULL) &&
 		    (flags & WQ_FLAG_EXCLUSIVE) &&
 		    !--nr_exclusive)
 			break;