kernel/signal.c · 3adaf93e4df47a4fa51deffdae1b25cf5590ed04 · Kirill Smelkov / linux

[PATCH] signal handling race condition causing reboot hangs · 3adaf93e
Andrew Morton authored Aug 18, 2003
From: Ernie Petrides <petrides@redhat.com>

(I can't get anyone to review this, but I'm sure there's a bug in there, and
Ernie's patch has been in -mm for some time).


There is a long-standing locking hole in the kernel's handling of the
signals related to stopping and resuming processes.  When a process
handles SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU, the "sighand" lock is
held while the signal is dequeued and appropriate masks are updated.
But the "sighand" lock is dropped in several cases before the task's
state is changed to TASK_STOPPED (or before a group-stop is initiated).

If a process running on another cpu posts a SIGCONT or SIGKILL just after
the "victim" process releases the lock but before its state is set to
TASK_STOPPED, the corresponding wakeup will be lost and the victim will
remain stopped despite the successive SIGCONT or SIGKILL.  In this case,
a repeated posting of SIGCONT or SIGKILL will have no effect, since the
original one is already pending (and so causes a repeated posting to be
discarded).  The occurrence of a SIGSTOP/SIGKILL race where the victim
has blocked all other signals will result in an unkillable process.

Although a fabricated test program can reproduce a SIGSTOP/SIGCONT race
hang in less than a minute (on a 2-cpu Dell Precision 450), the scenario
that has been most frequently encountered is a hang during reboot or
shutdown.  This occurs because /sbin/killall5 brackets the scanning of
/proc/* and associated signal posting to (most) of the processes still
running with kill(-1, SIGSTOP) and kill(-1, SIGCONT) calls to temporarily
freeze every process except for "init".  Occasionally, its parent (running
the /etc/rc6.d/S01reboot shell script) gets stuck in TASK_STOPPED state
with pending SIGCONT and SIGCLD signals, but with no other process left
to wake it up.

In order to fix the race condition, the locking in do_signal_stop()
and get_signal_to_deliver() needed reworking to close the hole.  Due
to lock ordering issues between the "sighand" lock and tasklist_lock,
there are two cases where the former lock needs to be released and
then reacquired, thus allowing a tiny hole for a SIGCONT/SIGKILL to
be posted.  These two cases are resolved by rechecking for a pending
SIGCONT/SIGKILL after the locks are (re)acquired in the proper order.

Anyone wanting a copy of the test program may e-mail me off-list.
3adaf93e
signal.c 64.2 KB
Replace signal.c