• Andrew Morton's avatar
    [PATCH] signal handling race condition causing reboot hangs · 3adaf93e
    Andrew Morton authored
    From: Ernie Petrides <petrides@redhat.com>
    
    (I can't get anyone to review this, but I'm sure there's a bug in there, and
    Ernie's patch has been in -mm for some time).
    
    
    There is a long-standing locking hole in the kernel's handling of the
    signals related to stopping and resuming processes.  When a process
    handles SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU, the "sighand" lock is
    held while the signal is dequeued and appropriate masks are updated.
    But the "sighand" lock is dropped in several cases before the task's
    state is changed to TASK_STOPPED (or before a group-stop is initiated).
    
    If a process running on another cpu posts a SIGCONT or SIGKILL just after
    the "victim" process releases the lock but before its state is set to
    TASK_STOPPED, the corresponding wakeup will be lost and the victim will
    remain stopped despite the successive SIGCONT or SIGKILL.  In this case,
    a repeated posting of SIGCONT or SIGKILL will have no effect, since the
    original one is already pending (and so causes a repeated posting to be
    discarded).  The occurrence of a SIGSTOP/SIGKILL race where the victim
    has blocked all other signals will result in an unkillable process.
    
    Although a fabricated test program can reproduce a SIGSTOP/SIGCONT race
    hang in less than a minute (on a 2-cpu Dell Precision 450), the scenario
    that has been most frequently encountered is a hang during reboot or
    shutdown.  This occurs because /sbin/killall5 brackets the scanning of
    /proc/* and associated signal posting to (most) of the processes still
    running with kill(-1, SIGSTOP) and kill(-1, SIGCONT) calls to temporarily
    freeze every process except for "init".  Occasionally, its parent (running
    the /etc/rc6.d/S01reboot shell script) gets stuck in TASK_STOPPED state
    with pending SIGCONT and SIGCLD signals, but with no other process left
    to wake it up.
    
    In order to fix the race condition, the locking in do_signal_stop()
    and get_signal_to_deliver() needed reworking to close the hole.  Due
    to lock ordering issues between the "sighand" lock and tasklist_lock,
    there are two cases where the former lock needs to be released and
    then reacquired, thus allowing a tiny hole for a SIGCONT/SIGKILL to
    be posted.  These two cases are resolved by rechecking for a pending
    SIGCONT/SIGKILL after the locks are (re)acquired in the proper order.
    
    Anyone wanting a copy of the test program may e-mail me off-list.
    3adaf93e
signal.c 64.2 KB