• Thomas Gleixner's avatar
    x86/apic: Make apic_pending_intr_clear() more robust · d29c7b8b
    Thomas Gleixner authored
    [ Upstream commit cc8bf191 ]
    
    In course of developing shorthand based IPI support issues with the
    function which tries to clear eventually pending ISR bits in the local APIC
    were observed.
    
      1) O-day testing triggered the WARN_ON() in apic_pending_intr_clear().
    
         This warning is emitted when the function fails to clear pending ISR
         bits or observes pending IRR bits which are not delivered to the CPU
         after the stale ISR bit(s) are ACK'ed.
    
         Unfortunately the function only emits a WARN_ON() and fails to dump
         the IRR/ISR content. That's useless for debugging.
    
         Feng added spot on debug printk's which revealed that the stale IRR
         bit belonged to the APIC timer interrupt vector, but adding ad hoc
         debug code does not help with sporadic failures in the field.
    
         Rework the loop so the full IRR/ISR contents are saved and on failure
         dumped.
    
      2) The loop termination logic is interesting at best.
    
         If the machine has no TSC or cpu_khz is not known yet it tries 1
         million times to ack stale IRR/ISR bits. What?
    
         With TSC it uses the TSC to calculate the loop termination. It takes a
         timestamp at entry and terminates the loop when:
    
         	  (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
    
         That's roughly one second.
    
         Both methods are problematic. The APIC has 256 vectors, which means
         that in theory max. 256 IRR/ISR bits can be set. In practice this is
         impossible and the chance that more than a few bits are set is close
         to zero.
    
         With the pure loop based approach the 1 million retries are complete
         overkill.
    
         With TSC this can terminate too early in a guest which is running on a
         heavily loaded host even with only a couple of IRR/ISR bits set. The
         reason is that after acknowledging the highest priority ISR bit,
         pending IRRs must get serviced first before the next round of
         acknowledge can take place as the APIC (real and virtualized) does not
         honour EOI without a preceeding interrupt on the CPU. And every APIC
         read/write takes a VMEXIT if the APIC is virtualized. While trying to
         reproduce the issue 0-day reported it was observed that the guest was
         scheduled out long enough under heavy load that it terminated after 8
         iterations.
    
         Make the loop terminate after 512 iterations. That's plenty enough
         in any case and does not take endless time to complete.
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20190722105219.158847694@linutronix.deSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
    d29c7b8b
apic.c 69.5 KB