• Andrew Morton's avatar
    [PATCH] Hotplug CPU sched_balance_exec Fix · 91bc0bf7
    Andrew Morton authored
    From: Rusty Russell <rusty@rustcorp.com.au>
    
    From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
    From: Andrew Morton <akpm@osdl.org>
    From: Rusty Russell <rusty@rustcorp.com.au>
    
    We want to get rid of lock_cpu_hotplug() in sched_migrate_task.  Found
    that lockless migration of execing task is _extremely_ racy.  The
    races I hit are described below, alongwith probable solutions.
    
    Task migration done elsewhere should be safe (?) since they either
    hold the lock (sys_sched_setaffinity) or are done entirely with preemption 
    disabled (load_balance).
    
       sched_balance_exec does:
    
    	a. disables preemption
    	b. finds new_cpu for current
    	c. enables preemption
    	d. calls sched_migrate_task to migrate current to new_cpu
    
       and sched_migrate_task does:
    
    	e. task_rq_lock(p)
    	f. migrate_task(p, dest_cpu ..)
    		(if we have to wait for migration thread)
    		g. task_rq_unlock()
    		h. wake_up_process(rq->migration_thread)
    		i. wait_for_completion()
    
       Several things can happen here:
    
    	1. new_cpu can go down after h and before migration thread has
    	   got around to handle the request
    
    	   ==> we need to add a cpu_is_offline check in __migrate_task
    
    	2. new_cpu can go down between c and d or before f.
    
    	   ===> Even though this case is automatically handled by the above 
    	        change (migrate_task being called on a running task, current,
    		will delegate migration to migration thread), would it be 
    	 	good practice to avoid calling migrate_task in the first place
    		itself when dest_cpu is offline. This means adding another
    	 	cpu_is_offline check after e in sched_migrate_task
    
    	3. The 'current' task can get preempted _immediately_ after
    	   g and when it comes back, task_cpu(p) can be dead. In
    	   which case, it is invalid to do wake_up on a non-existent migration 
    	   thread.  (rq->migration_thread can be NULL).
    
    	   ===> We should disable preemption thr' g and h
    
    	4. Before migration thread gets around to handle the request, its cpu
    	   goes dead. This will leave unhandled migration requests in the dead 
    	   cpu. 
    
    	   ===> We need to wakeup sleeping requestors (if any) in CPU_DEAD
    	        notification.
    
    I really wonder if we can get rid of these issues by avoiding balancing at 
    exec time and instead have it balanced during load_balance ..Alternately
    if this is valuable and we want to retain it, I think we still need to
    consider a read/write sem, with sched_migrate_task doing down_read_trylock.
    This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE 
    notification, which had forced me away from r/w sem.
    
    Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1
    and has been tested on a 4way Intel Pentium SMP m/c.
    
    
    Rusty sez:
    
    Two other changes:
    1) I grabbed a reference to the thread, rather than using
    preempt_disable().  It's the more obvious way I think.
    
    2) Why the wait_to_die code?  It might be needed if we move tasks after
    stop_machine, but for nowI don't see the problem with the migration
    thread running on the wrong CPU for a bit: nothing is on this runqueue
    so active_load_balance is safe, and __migrate task will be a noop (due
    to cpu_is_offline() check).  If there is a problem, your fix is racy,
    because we could be preempted immediately afterwards.
    
    So I just stop the kthread then wakeup any remaining...
    91bc0bf7
sched.c 90.4 KB