1. 02 Jul, 2019 4 commits
    • Wanpeng Li's avatar
      KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC · bb34e690
      Wanpeng Li authored
      Thomas reported that:
      
       | Background:
       |
       |    In preparation of supporting IPI shorthands I changed the CPU offline
       |    code to software disable the local APIC instead of just masking it.
       |    That's done by clearing the APIC_SPIV_APIC_ENABLED bit in the APIC_SPIV
       |    register.
       |
       | Failure:
       |
       |    When the CPU comes back online the startup code triggers occasionally
       |    the warning in apic_pending_intr_clear(). That complains that the IRRs
       |    are not empty.
       |
       |    The offending vector is the local APIC timer vector who's IRR bit is set
       |    and stays set.
       |
       | It took me quite some time to reproduce the issue locally, but now I can
       | see what happens.
       |
       | It requires apicv_enabled=0, i.e. full apic emulation. With apicv_enabled=1
       | (and hardware support) it behaves correctly.
       |
       | Here is the series of events:
       |
       |     Guest CPU
       |
       |     goes down
       |
       |       native_cpu_disable()
       |
       | 			apic_soft_disable();
       |
       |     play_dead()
       |
       |     ....
       |
       |     startup()
       |
       |       if (apic_enabled())
       |         apic_pending_intr_clear()	<- Not taken
       |
       |      enable APIC
       |
       |         apic_pending_intr_clear()	<- Triggers warning because IRR is stale
       |
       | When this happens then the deadline timer or the regular APIC timer -
       | happens with both, has fired shortly before the APIC is disabled, but the
       | interrupt was not serviced because the guest CPU was in an interrupt
       | disabled region at that point.
       |
       | The state of the timer vector ISR/IRR bits:
       |
       |     	     	       	        ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      1
       |
       | Now one would assume that the IRR is cleared after the INIT reset, but this
       | happens only on CPU0.
       |
       | Why?
       |
       | Because our CPU0 hotplug is just for testing to make sure nothing breaks
       | and goes through an NMI wakeup vehicle because INIT would send it through
       | the boots-trap code which is not really working if that CPU was not
       | physically unplugged.
       |
       | Now looking at a real world APIC the situation in that case is:
       |
       |     	     	       	      	ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      0
       |
       | Why?
       |
       | Once the dying CPU reenables interrupts the pending interrupt gets
       | delivered as a spurious interupt and then the state is clear.
       |
       | While that CPU0 hotplug test case is surely an esoteric issue, the APIC
       | emulation is still wrong, Even if the play_dead() code would not enable
       | interrupts then the pending IRR bit would turn into an ISR .. interrupt
       | when the APIC is reenabled on startup.
      
      From SDM 10.4.7.2 Local APIC State After It Has Been Software Disabled
      * Pending interrupts in the IRR and ISR registers are held and require
        masking or handling by the CPU.
      
      In Thomas's testing, hardware cpu will not respect soft disable LAPIC
      when IRR has already been set or APICv posted-interrupt is in flight,
      so we can skip soft disable APIC checking when clearing IRR and set ISR,
      continue to respect soft disable APIC when attempting to set IRR.
      Reported-by: default avatarRong Chen <rong.a.chen@intel.com>
      Reported-by: default avatarFeng Tang <feng.tang@intel.com>
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rong Chen <rong.a.chen@intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb34e690
    • Liran Alon's avatar
      KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS · 323d73a8
      Liran Alon authored
      Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS
      capability is enabled on vCPU.
      As indicated by vmx->nested.enlightened_vmcs_enabled.
      
      This is quite bizarre as userspace VMM should make sure to expose
      same vCPU with same CPUID values in both source and destination.
      In case vCPU is exposed with eVMCS support on CPUID, it is also
      expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability.
      Therefore, KVM_STATE_NESTED_EVMCS is redundant.
      
      KVM_STATE_NESTED_EVMCS is currently used on restore path
      (vmx_set_nested_state()) only to enable eVMCS capability in KVM
      and to signal need_vmcs12_sync such that on next VMEntry to guest
      nested_sync_from_vmcs12() will be called to sync vmcs12 content
      into eVMCS in guest memory.
      However, because restore nested-state is rare enough, we could
      have just modified vmx_set_nested_state() to always signal
      need_vmcs12_sync.
      
      From all the above, it seems that we could have just removed
      the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve
      backwards migration compatibility, we cannot do that.
      (vmx_get_nested_state() needs to signal flag when migrating from
      new kernel to old kernel).
      
      Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled
      have a bad side-effect of userspace VMM having to send nested-state
      from source to destination as part of migration stream. Even if
      guest have never used eVMCS as it doesn't even run a nested
      hypervisor workload. This requires destination userspace VMM and
      KVM to support setting nested-state. Which make it more difficult
      to migrate from new host to older host.
      To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is
      not only enabled but also active. i.e. Guest have made some
      eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied
      from eVMCS and therefore should be restored into eVMCS resident
      in memory (by copy_vmcs12_to_enlightened()).
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarMaran Wilson <maran.wilson@oracle.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      323d73a8
    • Liran Alon's avatar
      KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM · 65b712f1
      Liran Alon authored
      As comment in code specifies, SMM temporarily disables VMX so we cannot
      be in guest mode, nor can VMLAUNCH/VMRESUME be pending.
      
      However, code currently assumes that these are the only flags that can be
      set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS
      can also be set on this field to signal that eVMCS should be enabled.
      
      Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME
      explicitly.
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      65b712f1
    • Paolo Bonzini's avatar
      KVM: x86: degrade WARN to pr_warn_ratelimited · 3f16a5c3
      Paolo Bonzini authored
      This warning can be triggered easily by userspace, so it should certainly not
      cause a panic if panic_on_warn is set.
      
      Reported-by: syzbot+c03f30b4f4c46bdf8575@syzkaller.appspotmail.com
      Suggested-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3f16a5c3
  2. 30 Jun, 2019 3 commits
  3. 29 Jun, 2019 29 commits
  4. 28 Jun, 2019 4 commits