1. 12 Jul, 2011 40 commits
    • Nadav Har'El's avatar
      KVM: nVMX: Additional TSC-offset handling · 7991825b
      Nadav Har'El authored
      In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
      emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
      set vmcs12.tsc_offset, for this change to survive the next nested entry (see
      prepare_vmcs02()).
      Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics
      of this function is that the TSC of all guests on this vcpu, L1 and possibly
      several L2s, need to be adjusted. To do this, we need to adjust vmcs01's
      tsc_offset (this offset will also apply to each L2s we enter). We can't set
      vmcs01 now, so we have to remember this adjustment and apply it when we
      later exit to L1.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      7991825b
    • Nadav Har'El's avatar
      KVM: nVMX: Further fixes for lazy FPU loading · 36cf24e0
      Nadav Har'El authored
      KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
      if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
      NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
      traps. And of course, conversely: If L1 wanted to trap these events, we
      must let it, even if L0 is not interested in them.
      
      This patch fixes some existing KVM code (in update_exception_bitmap(),
      vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
      and L1's needs. Note that handle_cr() was already fixed in the above patch,
      and that new code in introduced in previous patches already handles CR0
      correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      36cf24e0
    • Nadav Har'El's avatar
      KVM: nVMX: Handling of CR0 and CR4 modifying instructions · eeadf9e7
      Nadav Har'El authored
      When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
      which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
      thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
      previous patch).
      When L2 modifies bits that L1 doesn't care about, we let it think (via
      CR[04]_READ_SHADOW) that it did these modifications, while only changing
      (in GUEST_CR[04]) the bits that L0 doesn't shadow.
      
      This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
      want to leave TS on, while pretending to allow the guest to change it.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      eeadf9e7
    • Nadav Har'El's avatar
      KVM: nVMX: Correct handling of idt vectoring info · 66c78ae4
      Nadav Har'El authored
      This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
      case.
      
      When a guest exits while delivering an interrupt or exception, we get this
      information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
      there's nothing we need to do, because L1 will see this field in vmcs12, and
      handle it itself. However, when L2 exits and L0 handles the exit itself and
      plans to return to L2, L0 must inject this event to L2.
      
      In the normal non-nested case, the idt_vectoring_info case is discovered after
      the exit, and the decision to inject (though not the injection itself) is made
      at that point. However, in the nested case a decision of whether to return
      to L2 or L1 also happens during the injection phase (see the previous
      patches), so in the nested case we can only decide what to do about the
      idt_vectoring_info right after the injection, i.e., in the beginning of
      vmx_vcpu_run, which is the first time we know for sure if we're staying in
      L2.
      
      Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
      vmx_complete_interrupts() code which queues the idt_vectoring_info for
      injection on next entry - because such injection would not be appropriate
      if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
      and related fields in vmcs12 (which is a convenient place to save these
      fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
      potentially exiting to L1 to inject an event requested by user space), if
      we find ourselves in L1 we don't need to do anything with those values
      we saved (as explained above). But if we find that we're in L2, or rather
      *still* at L2 (it's not nested_run_pending, meaning that this is the first
      round of L2 running after L1 having just launched it), we need to inject
      the event saved in those fields - by writing the appropriate VMCS fields.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      66c78ae4
    • Nadav Har'El's avatar
      KVM: nVMX: Correct handling of exception injection · 0b6ac343
      Nadav Har'El authored
      Similar to the previous patch, but concerning injection of exceptions rather
      than external interrupts.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      0b6ac343
    • Nadav Har'El's avatar
      KVM: nVMX: Correct handling of interrupt injection · b6f1250e
      Nadav Har'El authored
      The code in this patch correctly emulates external-interrupt injection
      while a nested guest L2 is running.
      
      Because of this code's relative un-obviousness, I include here a longer-than-
      usual justification for what it does - much longer than the code itself ;-)
      
      To understand how to correctly emulate interrupt injection while L2 is
      running, let's look first at what we need to emulate: How would things look
      like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
      an interrupt, we had hardware delivering an interrupt?
      
      Now we have L1 running on bare metal with a guest L2, and the hardware
      generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and
      VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what
      happens now is this: The processor exits from L2 to L1, with an external-
      interrupt exit reason but without an interrupt vector. L1 runs, with
      interrupts disabled, and it doesn't yet know what the interrupt was. Soon
      after, it enables interrupts and only at that moment, it gets the interrupt
      from the processor. when L1 is KVM, Linux handles this interrupt.
      
      Now we need exactly the same thing to happen when that L1->L2 system runs
      on top of L0, instead of real hardware. This is how we do this:
      
      When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
      external-interrupt exit reason (with an invalid interrupt vector), and run L1.
      Just like in the bare metal case, it likely can't deliver the interrupt to
      L1 now because L1 is running with interrupts disabled, in which case it turns
      on the interrupt window when running L1 after the exit. L1 will soon enable
      interrupts, and at that point L0 will gain control again and inject the
      interrupt to L1.
      
      Finally, there is an extra complication in the code: when nested_run_pending,
      we cannot return to L1 now, and must launch L2. We need to remember the
      interrupt we wanted to inject (and not clear it now), and do it on the
      next exit.
      
      The above explanation shows that the relative strangeness of the nested
      interrupt injection code in this patch, and the extra interrupt-window
      exit incurred, are in fact necessary for accurate emulation, and are not
      just an unoptimized implementation.
      
      Let's revisit now the two assumptions made above:
      
      If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
      does, by the way), things are simple: L0 may inject the interrupt directly
      to the L2 guest - using the normal code path that injects to any guest.
      We support this case in the code below.
      
      If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT, things look very different from the
      description above: L1 expects to see an exit from L2 with the interrupt vector
      already filled in the exit information, and does not expect to be interrupted
      again with this interrupt. The current code does not (yet) support this case,
      so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on
      by L1.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      b6f1250e
    • Nadav Har'El's avatar
      KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit · 644d711a
      Nadav Har'El authored
      This patch contains the logic of whether an L2 exit should be handled by L0
      and then L2 should be resumed, or whether L1 should be run to handle this
      exit (using the nested_vmx_vmexit() function of the previous patch).
      
      The basic idea is to let L1 handle the exit only if it actually asked to
      trap this sort of event. For example, when L2 exits on a change to CR0,
      we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
      bit which changed; If it did, we exit to L1. But if it didn't it means that
      it is we (L0) that wished to trap this event, so we handle it ourselves.
      
      The next two patches add additional logic of what to do when an interrupt or
      exception is injected: Does L0 need to do it, should we exit to L1 to do it,
      or should we resume L2 and keep the exception to be injected later.
      
      We keep a new flag, "nested_run_pending", which can override the decision of
      which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
      L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
      and therefore expects L2 to be run (and perhaps be injected with an event it
      specified, etc.). Nested_run_pending is especially intended to avoid switching
      to L1 in the injection decision-point described above.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      644d711a
    • Nadav Har'El's avatar
      KVM: nVMX: vmcs12 checks on nested entry · 7c177938
      Nadav Har'El authored
      This patch adds a bunch of tests of the validity of the vmcs12 fields,
      according to what the VMX spec and our implementation allows. If fields
      we cannot (or don't want to) honor are discovered, an entry failure is
      emulated.
      
      According to the spec, there are two types of entry failures: If the problem
      was in vmcs12's host state or control fields, the VMLAUNCH instruction simply
      fails. But a problem is found in the guest state, the behavior is more
      similar to that of an exit.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      7c177938
    • Nadav Har'El's avatar
      KVM: nVMX: Exiting from L2 to L1 · 4704d0be
      Nadav Har'El authored
      This patch implements nested_vmx_vmexit(), called when the nested L2 guest
      exits and we want to run its L1 parent and let it handle this exit.
      
      Note that this will not necessarily be called on every L2 exit. L0 may decide
      to handle a particular exit on its own, without L1's involvement; In that
      case, L0 will handle the exit, and resume running L2, without running L1 and
      without calling nested_vmx_vmexit(). The logic for deciding whether to handle
      a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
      will appear in a separate patch below.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      4704d0be
    • Nadav Har'El's avatar
      KVM: nVMX: No need for handle_vmx_insn function any more · 99e65e80
      Nadav Har'El authored
      Before nested VMX support, the exit handler for a guest executing a VMX
      instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
      vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
      exception. Now that all these exit reasons are properly handled (and emulate
      the respective VMX instruction), nothing calls this dummy handler and it can
      be removed.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      99e65e80
    • Nadav Har'El's avatar
      KVM: nVMX: Implement VMLAUNCH and VMRESUME · cd232ad0
      Nadav Har'El authored
      Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
      hypervisor to run its own guests.
      
      This patch does not include some of the necessary validity checks on
      vmcs12 fields before the entry. These will appear in a separate patch
      below.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      cd232ad0
    • Nadav Har'El's avatar
      KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12 · fe3ef05c
      Nadav Har'El authored
      This patch contains code to prepare the VMCS which can be used to actually
      run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
      in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
      own guests).
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      fe3ef05c
    • Nadav Har'El's avatar
      KVM: nVMX: Move control field setup to functions · bf8179a0
      Nadav Har'El authored
      Move some of the control field setup to common functions. These functions will
      also be needed for running L2 guests - L0's desires (expressed in these
      functions) will be appropriately merged with L1's desires.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      bf8179a0
    • Nadav Har'El's avatar
      KVM: nVMX: Move host-state field setup to a function · a3a8ff8e
      Nadav Har'El authored
      Move the setting of constant host-state fields (fields that do not change
      throughout the life of the guest) from vmx_vcpu_setup to a new common function
      vmx_set_constant_host_state(). This function will also be used to set the
      host state when running L2 guests.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      a3a8ff8e
    • Nadav Har'El's avatar
      KVM: nVMX: Implement VMREAD and VMWRITE · 49f705c5
      Nadav Har'El authored
      Implement the VMREAD and VMWRITE instructions. With these instructions, L1
      can read and write to the VMCS it is holding. The values are read or written
      to the fields of the vmcs12 structure introduced in a previous patch.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      49f705c5
    • Nadav Har'El's avatar
      KVM: nVMX: Implement VMPTRST · 6a4d7550
      Nadav Har'El authored
      This patch implements the VMPTRST instruction.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      6a4d7550
    • Nadav Har'El's avatar
      KVM: nVMX: Implement VMPTRLD · 63846663
      Nadav Har'El authored
      This patch implements the VMPTRLD instruction.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      63846663
    • Nadav Har'El's avatar
      KVM: nVMX: Implement VMCLEAR · 27d6c865
      Nadav Har'El authored
      This patch implements the VMCLEAR instruction.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      27d6c865
    • Nadav Har'El's avatar
      KVM: nVMX: Success/failure of VMX instructions. · 0140caea
      Nadav Har'El authored
      VMX instructions specify success or failure by setting certain RFLAGS bits.
      This patch contains common functions to do this, and they will be used in
      the following patches which emulate the various VMX instructions.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      0140caea
    • Nadav Har'El's avatar
      KVM: nVMX: Add VMCS fields to the vmcs12 · 22bd0358
      Nadav Har'El authored
      In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
      standard VMCS fields.
      
      Later patches will enable L1 to read and write these fields using VMREAD/
      VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
      a hardware VMCS for running L2.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      22bd0358
    • Nadav Har'El's avatar
      KVM: nVMX: Introduce vmcs02: VMCS used to run L2 · ff2f6fe9
      Nadav Har'El authored
      We saw in a previous patch that L1 controls its L2 guest with a vcms12.
      L0 needs to create a real VMCS for running L2. We call that "vmcs02".
      A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
      fields. This patch only contains code for allocating vmcs02.
      
      In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
      enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
      be reused even when L1 runs multiple L2 guests. However, in future versions
      we'll probably want to add an optimization where vmcs02 fields that rarely
      change will not be set each time. For that, we may want to keep around several
      vmcs02s of L2 guests that have recently run, so that potentially we could run
      these L2s again more quickly because less vmwrites to vmcs02 will be needed.
      
      This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
      which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
      As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
      I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
      reused to enter any L2 guest. In the future, when prepare_vmcs02() is
      optimized not to set all fields every time, VMCS02_POOL_SIZE should be
      increased.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      ff2f6fe9
    • Nadav Har'El's avatar
      KVM: nVMX: Decoding memory operands of VMX instructions · 064aea77
      Nadav Har'El authored
      This patch includes a utility function for decoding pointer operands of VMX
      instructions issued by L1 (a guest hypervisor)
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      064aea77
    • Nadav Har'El's avatar
      KVM: nVMX: Implement reading and writing of VMX MSRs · b87a51ae
      Nadav Har'El authored
      When the guest can use VMX instructions (when the "nested" module option is
      on), it should also be able to read and write VMX MSRs, e.g., to query about
      VMX capabilities. This patch adds this support.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      b87a51ae
    • Nadav Har'El's avatar
      KVM: nVMX: Introduce vmcs12: a VMCS structure for L1 · a9d30f33
      Nadav Har'El authored
      An implementation of VMX needs to define a VMCS structure. This structure
      is kept in guest memory, but is opaque to the guest (who can only read or
      write it with VMX instructions).
      
      This patch starts to define the VMCS structure which our nested VMX
      implementation will present to L1. We call it "vmcs12", as it is the VMCS
      that L1 keeps for its L2 guest. We will add more content to this structure
      in later patches.
      
      This patch also adds the notion (as required by the VMX spec) of L1's "current
      VMCS", and finally includes utility functions for mapping the guest-allocated
      VMCSs in host memory.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      a9d30f33
    • Nadav Har'El's avatar
      KVM: nVMX: Allow setting the VMXE bit in CR4 · 5e1746d6
      Nadav Har'El authored
      This patch allows the guest to enable the VMXE bit in CR4, which is a
      prerequisite to running VMXON.
      
      Whether to allow setting the VMXE bit now depends on the architecture (svm
      or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
      now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
      will also return 1, and this will cause kvm_set_cr4() will throw a #GP.
      
      Turning on the VMXE bit is allowed only when the nested VMX feature is
      enabled, and turning it off is forbidden after a vmxon.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      5e1746d6
    • Nadav Har'El's avatar
      KVM: nVMX: Implement VMXON and VMXOFF · ec378aee
      Nadav Har'El authored
      This patch allows a guest to use the VMXON and VMXOFF instructions, and
      emulates them accordingly. Basically this amounts to checking some
      prerequisites, and then remembering whether the guest has enabled or disabled
      VMX operation.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      ec378aee
    • Nadav Har'El's avatar
      KVM: nVMX: Add "nested" module option to kvm_intel · 801d3424
      Nadav Har'El authored
      This patch adds to kvm_intel a module option "nested". This option controls
      whether the guest can use VMX instructions, i.e., whether we allow nested
      virtualization. A similar, but separate, option already exists for the
      SVM module.
      
      This option currently defaults to 0, meaning that nested VMX must be
      explicitly enabled by giving nested=1. When nested VMX matures, the default
      should probably be changed to enable nested VMX by default - just like
      nested SVM is currently enabled by default.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      801d3424
    • Takuya Yoshikawa's avatar
      KVM: x86 emulator: Avoid clearing the whole decode_cache · b5c9ff73
      Takuya Yoshikawa authored
      During tracing the emulator, we noticed that init_emulate_ctxt()
      sometimes took a bit longer time than we expected.
      
      This patch is for mitigating the problem by some degree.
      
      By looking into the function, we soon notice that it clears the whole
      decode_cache whose size is about 2.5K bytes now.  Furthermore, most of
      the bytes are taken for the two read_cache arrays, which are used only
      by a few instructions.
      
      Considering the fact that we are not assuming the cache arrays have
      been cleared when we store actual data, we do not need to clear the
      arrays: 2K bytes elimination.  In addition, we can avoid clearing the
      fetch_cache and regs arrays.
      
      This patch changes the initialization not to clear the arrays.
      
      On our 64-bit host, init_emulate_ctxt() becomes 0.3 to 0.5us faster with
      this patch applied.
      Signed-off-by: default avatarTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      b5c9ff73
    • Takuya Yoshikawa's avatar
      KVM: x86 emulator: Clean up init_emulate_ctxt() · adf52235
      Takuya Yoshikawa authored
      Use a local pointer to the emulate_ctxt for simplicity.  Then, arrange
      the hard-to-read mode selection lines neatly.
      Signed-off-by: default avatarTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      adf52235
    • Jan Kiszka's avatar
      KVM: Clean up error handling during VCPU creation · d780592b
      Jan Kiszka authored
      So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if
      it fails. Move this confusing resonsibility back into the hands of
      kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected,
      all other archs cannot fail.
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      d780592b
    • Nadav Har'El's avatar
      KVM: VMX: Keep list of loaded VMCSs, instead of vcpus · d462b819
      Nadav Har'El authored
      In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
      because (at least in theory) the processor might not have written all of its
      content back to memory. Since a patch from June 26, 2008, this is done using
      a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.
      
      The problem is that with nested VMX, we no longer have the concept of a
      vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
      L2s), and each of those may be have been last loaded on a different cpu.
      
      So instead of linking the vcpus, we link the VMCSs, using a new structure
      loaded_vmcs. This structure contains the VMCS, and the information pertaining
      to its loading on a specific cpu (namely, the cpu number, and whether it
      was already launched on this cpu once). In nested we will also use the same
      structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
      currently active VMCS.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Acked-by: default avatarAcked-by: Kevin Tian <kevin.tian@intel.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      d462b819
    • Avi Kivity's avatar
      KVM: Sanitize cpuid · 24c82e57
      Avi Kivity authored
      Instead of blacklisting known-unsupported cpuid leaves, whitelist known-
      supported leaves.  This is more conservative and prevents us from reporting
      features we don't support.  Also whitelist a few more leaves while at it.
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      Acked-by: default avatarJoerg Roedel <joerg.roedel@amd.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      24c82e57
    • Xiao Guangrong's avatar
      KVM: MMU: cleanup for dropping parent pte · bcdd9a93
      Xiao Guangrong authored
      Introduce drop_parent_pte to remove the rmap of parent pte and
      clear parent pte
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      bcdd9a93
    • Xiao Guangrong's avatar
      KVM: MMU: cleanup for kvm_mmu_page_unlink_children · 38e3b2b2
      Xiao Guangrong authored
      Cleanup the same operation between kvm_mmu_page_unlink_children and
      mmu_pte_write_zap_pte
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      38e3b2b2
    • Xiao Guangrong's avatar
      KVM: MMU: remove the arithmetic of parent pte rmap · 67052b35
      Xiao Guangrong authored
      Parent pte rmap and page rmap are very similar, so use the same arithmetic
      for them
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      67052b35
    • Xiao Guangrong's avatar
      KVM: MMU: abstract the operation of rmap · 53c07b18
      Xiao Guangrong authored
      Abstract the operation of rmap to spte_list, then we can use it for the
      reverse mapping of parent pte in the later patch
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      53c07b18
    • Xiao Guangrong's avatar
      KVM: fix uninitialized warning · 1249b96e
      Xiao Guangrong authored
      Fix:
      
       warning: ‘cs_sel’ may be used uninitialized in this function
       warning: ‘ss_sel’ may be used uninitialized in this function
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      1249b96e
    • Xiao Guangrong's avatar
      KVM: use __copy_to_user/__clear_user to write guest page · 8b0cedff
      Xiao Guangrong authored
      Simply use __copy_to_user/__clear_user to write guest page since we have
      already verified the user address when the memslot is set
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      8b0cedff
    • Xiao Guangrong's avatar
      KVM: MMU: optimize pte write path if don't have protected sp · 332b207d
      Xiao Guangrong authored
      Simply return from kvm_mmu_pte_write path if no shadow page is
      write-protected, then we can avoid to walk all shadow pages and hold
      mmu-lock
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      332b207d
    • Avi Kivity's avatar
      KVM: VMX: always_inline VMREADs · 96304217
      Avi Kivity authored
      vmcs_readl() and friends are really short, but gcc thinks they are long because of
      the out-of-line exception handlers.  Mark them always_inline to clear the
      misunderstanding.
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      96304217