[PATCH] Hotplug CPU sched_balance_exec Fix

From: Rusty Russell <rusty@rustcorp.com.au> From: Srivatsa Vaddagiri <vatsa@in.ibm.com> From: Andrew Morton <akpm@osdl.org> From: Rusty Russell <rusty@rustcorp.com.au> We want to get rid of lock_cpu_hotplug() in sched_migrate_task. Found that lockless migration of execing task is _extremely_ racy. The races I hit are described below, alongwith probable solutions. Task migration done elsewhere should be safe (?) since they either hold the lock (sys_sched_setaffinity) or are done entirely with preemption disabled (load_balance). sched_balance_exec does: a. disables preemption b. finds new_cpu for current c. enables preemption d. calls sched_migrate_task to migrate current to new_cpu and sched_migrate_task does: e. task_rq_lock(p) f. migrate_task(p, dest_cpu ..) (if we have to wait for migration thread) g. task_rq_unlock() h. wake_up_process(rq->migration_thread) i. wait_for_completion() Several things can happen here: 1. new_cpu can go down after h and before migration thread has got around to handle the request ==> we need to add a cpu_is_offline check in __migrate_task 2. new_cpu can go down between c and d or before f. ===> Even though this case is automatically handled by the above change (migrate_task being called on a running task, current, will delegate migration to migration thread), would it be good practice to avoid calling migrate_task in the first place itself when dest_cpu is offline. This means adding another cpu_is_offline check after e in sched_migrate_task 3. The 'current' task can get preempted _immediately_ after g and when it comes back, task_cpu(p) can be dead. In which case, it is invalid to do wake_up on a non-existent migration thread. (rq->migration_thread can be NULL). ===> We should disable preemption thr' g and h 4. Before migration thread gets around to handle the request, its cpu goes dead. This will leave unhandled migration requests in the dead cpu. ===> We need to wakeup sleeping requestors (if any) in CPU_DEAD notification. I really wonder if we can get rid of these issues by avoiding balancing at exec time and instead have it balanced during load_balance ..Alternately if this is valuable and we want to retain it, I think we still need to consider a read/write sem, with sched_migrate_task doing down_read_trylock. This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE notification, which had forced me away from r/w sem. Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1 and has been tested on a 4way Intel Pentium SMP m/c. Rusty sez: Two other changes: 1) I grabbed a reference to the thread, rather than using preempt_disable(). It's the more obvious way I think. 2) Why the wait_to_die code? It might be needed if we move tasks after stop_machine, but for nowI don't see the problem with the migration thread running on the wrong CPU for a bit: nothing is on this runqueue so active_load_balance is safe, and __migrate task will be a noop (due to cpu_is_offline() check). If there is a problem, your fix is racy, because we could be preempted immediately afterwards. So I just stop the kthread then wakeup any remaining...

[PATCH] Hotplug CPU sched_balance_exec Fix
From: Rusty Russell <rusty@rustcorp.com.au> From: Srivatsa Vaddagiri <vatsa@in.ibm.com> From: Andrew Morton <akpm@osdl.org> From: Rusty Russell <rusty@rustcorp.com.au> We want to get rid of lock_cpu_hotplug() in sched_migrate_task. Found that lockless migration of execing task is _extremely_ racy. The races I hit are described below, alongwith probable solutions. Task migration done elsewhere should be safe (?) since they either hold the lock (sys_sched_setaffinity) or are done entirely with preemption disabled (load_balance). sched_balance_exec does: a. disables preemption b. finds new_cpu for current c. enables preemption d. calls sched_migrate_task to migrate current to new_cpu and sched_migrate_task does: e. task_rq_lock(p) f. migrate_task(p, dest_cpu ..) (if we have to wait for migration thread) g. task_rq_unlock() h. wake_up_process(rq->migration_thread) i. wait_for_completion() Several things can happen here: 1. new_cpu can go down after h and before migration thread has got around to handle the request ==> we need to add a cpu_is_offline check in __migrate_task 2. new_cpu can go down between c and d or before f. ===> Even though this case is automatically handled by the above change (migrate_task being called on a running task, current, will delegate migration to migration thread), would it be good practice to avoid calling migrate_task in the first place itself when dest_cpu is offline. This means adding another cpu_is_offline check after e in sched_migrate_task 3. The 'current' task can get preempted _immediately_ after g and when it comes back, task_cpu(p) can be dead. In which case, it is invalid to do wake_up on a non-existent migration thread. (rq->migration_thread can be NULL). ===> We should disable preemption thr' g and h 4. Before migration thread gets around to handle the request, its cpu goes dead. This will leave unhandled migration requests in the dead cpu. ===> We need to wakeup sleeping requestors (if any) in CPU_DEAD notification. I really wonder if we can get rid of these issues by avoiding balancing at exec time and instead have it balanced during load_balance ..Alternately if this is valuable and we want to retain it, I think we still need to consider a read/write sem, with sched_migrate_task doing down_read_trylock. This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE notification, which had forced me away from r/w sem. Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1 and has been tested on a 4way Intel Pentium SMP m/c. Rusty sez: Two other changes: 1) I grabbed a reference to the thread, rather than using preempt_disable(). It's the more obvious way I think. 2) Why the wait_to_die code? It might be needed if we move tasks after stop_machine, but for nowI don't see the problem with the migration thread running on the wrong CPU for a bit: nothing is on this runqueue so active_load_balance is safe, and __migrate task will be a noop (due to cpu_is_offline() check). If there is a problem, your fix is racy, because we could be preempted immediately afterwards. So I just stop the kthread then wakeup any remaining...
91bc0bf7 · Andrew Morton · Linus Torvalds · 850f7d78 · 91bc0bf7
Commit 91bc0bf7 authored May 09, 2004 by Andrew Morton Committed by Linus Torvalds May 09, 2004
Show whitespace changes
Inline Side-by-side

Showing with 28 additions and 9 deletions

kernel/sched.c kernel/sched.c +28 -9

No files found.
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1124,16 +1124,19 @@ static void sched_migrate_task(task_t *p, int dest_cpu)
 	runqueue_t *rq;
 	unsigned long flags;

-	lock_cpu_hotplug();
 	rq = task_rq_lock(p, &flags);
-	if (!cpu_isset(dest_cpu, p->cpus_allowed))
+	if (!cpu_isset(dest_cpu, p->cpus_allowed)
+	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;

 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
-		/* Need to wait for migration thread. */
+		/* Need to wait for migration thread (might exit: take ref). */
+		struct task_struct *mt = rq->migration_thread;
+		get_task_struct(mt);
 		task_rq_unlock(rq, &flags);
-		wake_up_process(rq->migration_thread);
+		wake_up_process(mt);
+		put_task_struct(mt);
 		wait_for_completion(&req.done);

 		/*
@@ -1142,13 +1145,11 @@ static void sched_migrate_task(task_t *p, int dest_cpu)
 		 * the migration.
 		 */
 		tlb_migrate_prepare(current->mm);
-		unlock_cpu_hotplug();

 		return;
 	}
 out:
 	task_rq_unlock(rq, &flags);
-	unlock_cpu_hotplug();
 }

 /*
@@ -3191,6 +3192,9 @@ static void __migrate_task(struct task_struct *p, int dest_cpu)
 {
 	runqueue_t *rq_dest;

+	if (unlikely(cpu_is_offline(dest_cpu)))
+		return;
+
 	rq_dest = cpu_rq(dest_cpu);

 	double_rq_lock(this_rq(), rq_dest);
@@ -3349,9 +3353,24 @@ static int migration_call(struct notifier_block *nfb, unsigned long action,
 		/* Unbind it from offline cpu so it can run.  Fall thru. */
 		kthread_bind(cpu_rq(cpu)->migration_thread,smp_processor_id());
 	case CPU_DEAD:
-		kthread_stop(cpu_rq(cpu)->migration_thread);
-		cpu_rq(cpu)->migration_thread = NULL;
- 		BUG_ON(cpu_rq(cpu)->nr_running != 0);
+		rq = cpu_rq(cpu);
+		kthread_stop(rq->migration_thread);
+		rq->migration_thread = NULL;
+		BUG_ON(rq->nr_running != 0);
+
+		/* No need to migrate the tasks: it was best-effort if
+		 * they didn't do lock_cpu_hotplug().  Just wake up
+		 * the requestors. */
+		spin_lock_irq(&rq->lock);
+		while (!list_empty(&rq->migration_queue)) {
+			migration_req_t *req;
+			req = list_entry(rq->migration_queue.next,
+					 migration_req_t, list);
+			BUG_ON(req->type != REQ_MOVE_TASK);
+			list_del_init(&req->list);
+			complete(&req->done);
+		}
+		spin_unlock_irq(&rq->lock);
 		break;
 #endif
 	}