• Paul Mackerras's avatar
    [PATCH] ppc64: fix hang on legacy iSeries · 5bc58b21
    Paul Mackerras authored
    Recently we have uncovered a bug in the kernel exception exit path
    which can cause iSeries machines to hang with interrupts disabled,
    typically when unloading a module.  This patch fixes the bug and
    should go in 2.6.10.  Here is the detailed explanation:
    
    There are a couple of places in the exception exit path in entry.S
    where we disable interrupts and then later reenable them.  We
    hard-disable interrupts even on legacy iSeries (rather than
    soft-disabling them) because the final part of the exception exit path
    needs interrupts hard-disabled (even on legacy iSeries), because
    otherwise an incoming interrupt could trash SRR0 and SRR1 and cause us
    to lose state.
    
    The intention was that each path that hard-disabled interrupts would
    hard-enable them again, either explicitly or by executing an rfid
    instruction (return from interrupt, doubleword).  However there was
    one path where we didn't correctly hard-enable interrupts.  This meant
    we could end up calling schedule() with interrupts hard-disabled and
    then switch to the stopmachine thread (used in removing a module),
    which spins polling a variable until another cpu changes it.  Since
    local_irq_enable() etc. on legacy iSeries only soft-enable interrupts,
    we got into the stopmachine thread with interrupts hard-disabled, and
    the machine hung at that point.
    
    This patch fixes it by making sure that when we go to re-enable
    interrupts, the MSR value we are loading up actually does have the
    MSR.EE (external interrupt enable) bit set.  Stephen Rothwell has
    verified that this actually does fix the bug on iSeries.  The bug
    also potentially exists on pSeries (and this patch fixes it), but
    there it doesn't really matter, because schedule() will enable
    interrupts (and on pSeries that means hard-enabling them), and because
    the hypervisor doesn't mind you having interrupts hard-disabled for
    extended periods on pSeries.  Note that all these comments about
    pSeries also apply to POWER5 iSeries (i5) machines.
    
    While I was there I noticed that we were jumping to ret_from_except
    after calling do_IRQ on iSeries, rather than ret_from_except_lite,
    meaning that we will restore registers 14-31 twice, unnecessarily.  I
    changed it to jump to ret_from_except_lite instead, and Stephen
    checked that this change doesn't cause any breakage.
    Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    5bc58b21
entry.S 18.6 KB