1. 28 May, 2020 5 commits
    • Paolo Bonzini's avatar
      KVM: SVM: always update CR3 in VMCB · 978ce583
      Paolo Bonzini authored
      svm_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as
      an optimization, but this is only correct before the nested vmentry.
      If userspace is modifying CR3 with KVM_SET_SREGS after the VM has
      already been put in guest mode, the value of CR3 will not be updated.
      Remove the optimization, which almost never triggers anyway.
      This was was added in commit 689f3bf2 ("KVM: x86: unify callbacks
      to load paging root", 2020-03-16) just to keep the two vendor-specific
      modules closer, but we'll fix VMX too.
      
      Fixes: 689f3bf2 ("KVM: x86: unify callbacks to load paging root")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      978ce583
    • Paolo Bonzini's avatar
      KVM: nSVM: correctly inject INIT vmexits · 5b672408
      Paolo Bonzini authored
      The usual drill at this point, except there is no code to remove because this
      case was not handled at all.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5b672408
    • Paolo Bonzini's avatar
      KVM: nSVM: remove exit_required · bd279629
      Paolo Bonzini authored
      All events now inject vmexits before vmentry rather than after vmexit.  Therefore,
      exit_required is not set anymore and we can remove it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd279629
    • Paolo Bonzini's avatar
      KVM: nSVM: inject exceptions via svm_check_nested_events · 7c86663b
      Paolo Bonzini authored
      This allows exceptions injected by the emulator to be properly delivered
      as vmexits.  The code also becomes simpler, because we can just let all
      L0-intercepted exceptions go through the usual path.  In particular, our
      emulation of the VMX #DB exit qualification is very much simplified,
      because the vmexit injection path can use kvm_deliver_exception_payload
      to update DR6.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c86663b
    • Paolo Bonzini's avatar
      KVM: x86: enable event window in inject_pending_event · c9d40913
      Paolo Bonzini authored
      In case an interrupt arrives after nested.check_events but before the
      call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt
      window even if the interrupt is actually going to be a vmexit.  This is
      useless rather than harmful, but it really complicates reasoning about
      SVM's handling of the VINTR intercept.  We'd like to never bother with
      the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
      that case there is no interrupt window and we can just exit the nested
      guest whenever we want.
      
      This patch moves the opening of the interrupt window inside
      inject_pending_event.  This consolidates the check for pending
      interrupt/NMI/SMI in one place, and makes KVM's usage of immediate
      exits more consistent, extending it beyond just nested virtualization.
      
      There are two functional changes here.  They only affect corner cases,
      but overall they simplify the inject_pending_event.
      
      - re-injection of still-pending events will also use req_immediate_exit
      instead of using interrupt-window intercepts.  This should have no impact
      on performance on Intel since it simply replaces an interrupt-window
      or NMI-window exit for a preemption-timer exit.  On AMD, which has no
      equivalent of the preemption time, it may incur some overhead but an
      actual effect on performance should only be visible in pathological cases.
      
      - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true
      if an interrupt, NMI or SMI is blocked by nested_run_pending.  This
      makes sense because entering the VM will allow it to make progress
      and deliver the event.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9d40913
  2. 27 May, 2020 17 commits
  3. 20 May, 2020 2 commits
  4. 19 May, 2020 8 commits
    • Thomas Gleixner's avatar
      x86/kvm: Restrict ASYNC_PF to user space · 3a7c8faf
      Thomas Gleixner authored
      The async page fault injection into kernel space creates more problems than
      it solves. The host has absolutely no knowledge about the state of the
      guest if the fault happens in CPL0. The only restriction for the host is
      interrupt disabled state. If interrupts are enabled in the guest then the
      exception can hit arbitrary code. The HALT based wait in non-preemotible
      code is a hacky replacement for a proper hypercall.
      
      For the ongoing work to restrict instrumentation and make the RCU idle
      interaction well defined the required extra work for supporting async
      pagefault in CPL0 is just not justified and creates complexity for a
      dubious benefit.
      
      The CPL3 injection is well defined and does not cause any issues as it is
      more or less the same as a regular page fault from CPL3.
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200505134059.369802541@linutronix.de
      
      3a7c8faf
    • Thomas Gleixner's avatar
      x86/kvm: Sanitize kvm_async_pf_task_wait() · 6bca69ad
      Thomas Gleixner authored
      While working on the entry consolidation I stumbled over the KVM async page
      fault handler and kvm_async_pf_task_wait() in particular. It took me a
      while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
      invocations are just cargo cult programming. Several patches "fixed" RCU
      splats by curing the symptoms without noticing that the code is flawed 
      from a design perspective.
      
      The main problem is that this async injection is not based on a proper
      handshake mechanism and only respects the minimal requirement, i.e. the
      guest is not in a state where it has interrupts disabled.
      
      Aside of that the actual code is a convoluted one fits it all swiss army
      knife. It is invoked from different places with different RCU constraints:
      
        1) Host side:
      
           vcpu_enter_guest()
             kvm_x86_ops->handle_exit()
               kvm_handle_page_fault()
                 kvm_async_pf_task_wait()
      
           The invocation happens from fully preemptible context.
      
        2) Guest side:
      
           The async page fault interrupted:
      
               a) user space
      
      	 b) preemptible kernel code which is not in a RCU read side
      	    critical section
      
           	 c) non-preemtible kernel code or a RCU read side critical section
      	    or kernel code with CONFIG_PREEMPTION=n which allows not to
      	    differentiate between #2b and #2c.
      
      RCU is watching for:
      
        #1  The vCPU exited and current is definitely not the idle task
      
        #2a The #PF entry code on the guest went through enter_from_user_mode()
            which reactivates RCU
      
        #2b There is no preemptible, interrupts enabled code in the kernel
            which can run with RCU looking away. (The idle task is always
            non preemptible).
      
      I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
      voodoo at all.
      
      In #2c RCU is eventually not watching, but as that state cannot schedule
      anyway there is no point to worry about it so it has to invoke
      rcu_irq_enter() before running that code. This can be optimized, but this
      will be done as an extra step in course of the entry code consolidation
      work.
      
      So the proper solution for this is to:
      
        - Split kvm_async_pf_task_wait() into schedule and halt based waiting
          interfaces which share the enqueueing code.
      
        - Add comments (condensed form of this changelog) to spare others the
          time waste and pain of reverse engineering all of this with the help of
          uncomprehensible changelogs and code history.
      
        - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
          user mode and schedulable kernel side async page faults (#1, #2a, #2b)
      
        - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
          case (#2c).
      
          For this case also remove the rcu_irq_exit()/enter() pair around the
          halt as it is just a pointless exercise:
      
             - vCPUs can VMEXIT at any random point and can be scheduled out for
               an arbitrary amount of time by the host and this is not any
               different except that it voluntary triggers the exit via halt.
      
             - The interrupted context could have RCU watching already. So the
      	 rcu_irq_exit() before the halt is not gaining anything aside of
      	 confusing the reader. Claiming that this might prevent RCU stalls
      	 is just an illusion.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de
      
      6bca69ad
    • Andy Lutomirski's avatar
      x86/kvm: Handle async page faults directly through do_page_fault() · ef68017e
      Andy Lutomirski authored
      KVM overloads #PF to indicate two types of not-actually-page-fault
      events.  Right now, the KVM guest code intercepts them by modifying
      the IDT and hooking the #PF vector.  This makes the already fragile
      fault code even harder to understand, and it also pollutes call
      traces with async_page_fault and do_async_page_fault for normal page
      faults.
      
      Clean it up by moving the logic into do_page_fault() using a static
      branch.  This gets rid of the platform trap_init override mechanism
      completely.
      
      [ tglx: Fixed up 32bit, removed error code from the async functions and
        	massaged coding style ]
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200505134059.169270470@linutronix.de
      
      ef68017e
    • Thomas Gleixner's avatar
      context_tracking: Make guest_enter/exit() .noinstr ready · af1e56b7
      Thomas Gleixner authored
      Force inlining of the helpers and mark the instrumentable parts
      accordingly.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200505134341.672545766@linutronix.de
      
      af1e56b7
    • Peter Zijlstra's avatar
      lockdep: Prepare for noinstr sections · c86e9b98
      Peter Zijlstra authored
      Force inlining and prevent instrumentation of all sorts by marking the
      functions which are invoked from low level entry code with 'noinstr'.
      
      Split the irqflags tracking into two parts. One which does the heavy
      lifting while RCU is watching and the final one which can be invoked after
      RCU is turned off.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de
      
      c86e9b98
    • Thomas Gleixner's avatar
      tracing: Provide lockdep less trace_hardirqs_on/off() variants · 0995a5df
      Thomas Gleixner authored
      trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer
      core itself is safe, but the resulting tracepoints can be utilized by
      e.g. BPF which is unsafe.
      
      Provide variants which do not contain the lockdep invocation so the lockdep
      and tracer invocations can be split at the call site and placed
      properly. This is required because lockdep needs to be aware of the state
      before switching away from RCU idle and after switching to RCU idle because
      these transitions can take locks.
      
      As these code pathes are going to be non-instrumentable the tracer can be
      invoked after RCU is turned on and before the switch to RCU idle. So for
      these new variants there is no need to invoke the rcuidle aware tracer
      functions.
      
      Name them so they match the lockdep counterparts.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200505134100.270771162@linutronix.de
      
      0995a5df
    • Thomas Gleixner's avatar
      vmlinux.lds.h: Create section for protection against instrumentation · 65538966
      Thomas Gleixner authored
      Some code pathes, especially the low level entry code, must be protected
      against instrumentation for various reasons:
      
       - Low level entry code can be a fragile beast, especially on x86.
      
       - With NO_HZ_FULL RCU state needs to be established before using it.
      
      Having a dedicated section for such code allows to validate with tooling
      that no unsafe functions are invoked.
      
      Add the .noinstr.text section and the noinstr attribute to mark
      functions. noinstr implies notrace. Kprobes will gain a section check
      later.
      
      Provide also a set of markers: instrumentation_begin()/end()
      
      These are used to mark code inside a noinstr function which calls
      into regular instrumentable text section as safe.
      
      The instrumentation markers are only active when CONFIG_DEBUG_ENTRY is
      enabled as the end marker emits a NOP to prevent the compiler from merging
      the annotation points. This means the objtool verification requires a
      kernel compiled with this option.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200505134100.075416272@linutronix.de
      65538966
    • Paolo Bonzini's avatar
      KVM: x86: only do L1TF workaround on affected processors · d43e2675
      Paolo Bonzini authored
      KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
      in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
      in an L1TF attack.  This works as long as there are 5 free bits between
      MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
      access triggers a reserved-bit-set page fault.
      
      The bit positions however were computed wrongly for AMD processors that have
      encryption support.  In this case, x86_phys_bits is reduced (for example
      from 48 to 43, to account for the C bit at position 47 and four bits used
      internally to store the SEV ASID and other stuff) while x86_cache_bits in
      would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
      and bit 51 are set.  Then low_phys_bits would also cover some of the
      bits that are set in the shadow_mmio_value, terribly confusing the gfn
      caching mechanism.
      
      To fix this, avoid splitting gfns as long as the processor does not have
      the L1TF bug (which includes all AMD processors).  When there is no
      splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
      the overlap.  This fixes "npt=0" operation on EPYC processors.
      
      Thanks to Maxim Levitsky for bisecting this bug.
      
      Cc: stable@vger.kernel.org
      Fixes: 52918ed5 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d43e2675
  5. 15 May, 2020 8 commits