• Steven Rostedt's avatar
    x86: Add workaround to NMI iret woes · 3f3c8b8c
    Steven Rostedt authored
    In x86, when an NMI goes off, the CPU goes into an NMI context that
    prevents other NMIs to trigger on that CPU. If an NMI is suppose to
    trigger, it has to wait till the previous NMI leaves NMI context.
    At that time, the next NMI can trigger (note, only one more NMI will
    trigger, as only one can be latched at a time).
    
    The way x86 gets out of NMI context is by calling iret. The problem
    with this is that this causes problems if the NMI handle either
    triggers an exception, or a breakpoint. Both the exception and the
    breakpoint handlers will finish with an iret. If this happens while
    in NMI context, the CPU will leave NMI context and a new NMI may come
    in. As NMI handlers are not made to be re-entrant, this can cause
    havoc with the system, not to mention, the nested NMI will write
    all over the previous NMI's stack.
    
    Linus Torvalds proposed the following workaround to this problem:
    
    https://lkml.org/lkml/2010/7/14/264
    
    "In fact, I wonder if we couldn't just do a software NMI disable
    instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
    allocated statically) that points to the NMI stack frame, and just
    make the NMI code itself do something like
    
     NMI entry:
     - load percpu NMI stack frame pointer
     - if non-zero we know we're nested, and should ignore this NMI:
        - we're returning to kernel mode, so return immediately by using
    "popf/ret", which also keeps NMI's disabled in the hardware until the
    "real" NMI iret happens.
        - before the popf/iret, use the NMI stack pointer to make the NMI
    return stack be invalid and cause a fault
      - set the NMI stack pointer to the current stack pointer
    
     NMI exit (not the above "immediate exit because we nested"):
       clear the percpu NMI stack pointer
       Just do the iret.
    
    Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
    we'll take a fault, and that re-does our "delayed" NMI - and NMI's
    will stay masked.
    
    And if we didn't have a nested NMI, that iret will now unmask NMI's,
    and everything is happy."
    
    I first tried to follow this advice but as I started implementing this
    code, a few gotchas showed up.
    
    One, is accessing per-cpu variables in the NMI handler.
    
    The problem is that per-cpu variables use the %gs register to get the
    variable for the given CPU. But as the NMI may happen in userspace,
    we must first perform a SWAPGS to get to it. The NMI handler already
    does this later in the code, but its too late as we have saved off
    all the registers and we don't want to do that for a disabled NMI.
    
    Peter Zijlstra suggested to keep all variables on the stack. This
    simplifies things greatly and it has the added benefit of cache locality.
    
    Two, faulting on the iret.
    
    I really wanted to make this work, but it was becoming very hacky, and
    I never got it to be stable. The iret already had a fault handler for
    userspace faulting with bad segment registers, and getting NMI to trigger
    a fault and detect it was very tricky. But for strange reasons, the system
    would usually take a double fault and crash. I never figured out why
    and decided to go with a simple "jmp" approach. The new approach I took
    also simplified things.
    
    Finally, the last problem with Linus's approach was to have the nested
    NMI handler do a ret instead of an iret to give the first NMI NMI-context
    again.
    
    The problem is that ret is much more limited than an iret. I couldn't figure
    out how to get the stack back where it belonged. I could have copied the
    current stack, pushed the return onto it, but my fear here is that there
    may be some place that writes data below the stack pointer. I know that
    is not something code should depend on, but I don't want to chance it.
    I may add this feature later, but for now, an NMI handler that loses NMI
    context will not get it back.
    
    Here's what is done:
    
    When an NMI comes in, the HW pushes the interrupt stack frame onto the
    per cpu NMI stack that is selected by the IST.
    
    A special location on the NMI stack holds a variable that is set when
    the first NMI handler runs. If this variable is set then we know that
    this is a nested NMI and we process the nested NMI code.
    
    There is still a race when this variable is cleared and an NMI comes
    in just before the first NMI does the return. For this case, if the
    variable is cleared, we also check if the interrupted stack is the
    NMI stack. If it is, then we process the nested NMI code.
    
    Why the two tests and not just test the interrupted stack?
    
    If the first NMI hits a breakpoint and loses NMI context, and then it
    hits another breakpoint and while processing that breakpoint we get a
    nested NMI. When processing a breakpoint, the stack changes to the
    breakpoint stack. If another NMI comes in here we can't rely on the
    interrupted stack to be the NMI stack.
    
    If the variable is not set and the interrupted task's stack is not the
    NMI stack, then we know this is the first NMI and we can process things
    normally. But in order to do so, we need to do a few things first.
    
    1) Set the stack variable that tells us that we are in an NMI handler
    
    2) Make two copies of the interrupt stack frame.
       One copy is used to return on iret
       The other is used to restore the first one if we have a nested NMI.
    
    This is what the stack will look like:
    
    	  +-------------------------+
    	  | original SS             |
    	  | original Return RSP     |
    	  | original RFLAGS         |
    	  | original CS             |
    	  | original RIP            |
    	  +-------------------------+
    	  | temp storage for rdx    |
    	  +-------------------------+
    	  | NMI executing variable  |
    	  +-------------------------+
    	  | Saved SS                |
    	  | Saved Return RSP        |
    	  | Saved RFLAGS            |
    	  | Saved CS                |
    	  | Saved RIP               |
    	  +-------------------------+
    	  | copied SS               |
    	  | copied Return RSP       |
    	  | copied RFLAGS           |
    	  | copied CS               |
    	  | copied RIP              |
    	  +-------------------------+
    	  | pt_regs                 |
    	  +-------------------------+
    
    The original stack frame contains what the HW put in when we entered
    the NMI.
    
    We store %rdx as a temp variable to use. Both the original HW stack
    frame and this %rdx storage will be clobbered by nested NMIs so we
    can not rely on them later in the first NMI handler.
    
    The next item is the special stack variable that is set when we execute
    the rest of the NMI handler.
    
    Then we have two copies of the interrupt stack. The second copy is
    modified by any nested NMIs to let the first NMI know that we triggered
    a second NMI (latched) and that we should repeat the NMI handler.
    
    If the first NMI hits an exception or breakpoint that takes it out of
    NMI context, if a second NMI comes in before the first one finishes,
    it will update the copied interrupt stack to point to a fix up location
    to trigger another NMI.
    
    When the first NMI calls iret, it will instead jump to the fix up
    location. This fix up location will copy the saved interrupt stack back
    to the copy and execute the nmi handler again.
    
    Note, the nested NMI knows enough to check if it preempted a previous
    NMI handler while it is in the fixup location. If it has, it will not
    modify the copied interrupt stack and will just leave as if nothing
    happened. As the NMI handle is about to execute again, there's no reason
    to latch now.
    
    To test all this, I forced the NMI handler to call iret and take itself
    out of NMI context. I also added assemble code to write to the serial to
    make sure that it hits the nested path as well as the fix up path.
    Everything seems to be working fine.
    
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: H. Peter Anvin <hpa@linux.intel.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Paul Turner <pjt@google.com>
    Cc: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
    3f3c8b8c
entry_64.S 41.9 KB