• David Matlack's avatar
    KVM: x86/mmu: Process atomically-zapped SPTEs after TLB flush · aca48556
    David Matlack authored
    When zapping TDP MMU SPTEs under read-lock, processes zapped SPTEs *after*
    flushing TLBs and after replacing the special REMOVED_SPTE with '0'.
    When zapping an SPTE that points to a page table, processing SPTEs after
    flushing TLBs minimizes contention on the child SPTEs (e.g. vCPUs won't
    hit write-protection faults via stale, read-only child SPTEs), and
    processing after replacing REMOVED_SPTE with '0' minimizes the amount of
    time vCPUs will be blocked by the REMOVED_SPTE.
    
    Processing SPTEs after setting the SPTE to '0', i.e. in parallel with the
    SPTE potentially being replacing with a new SPTE, is safe because KVM does
    not depend on completing the processing before a new SPTE is installed, and
    the processing is done on a subset of the page tables that is disconnected
    from the root, and thus unreachable by other tasks (after the TLB flush).
    KVM already relies on similar logic, as kvm_mmu_zap_all_fast() can result
    in KVM processing all SPTEs in a given root after vCPUs create mappings in
    a new root.
    
    In VMs with a large (400+) number of vCPUs, it can take KVM multiple
    seconds to process a 1GiB region mapped with 4KiB entries, e.g. when
    disabling dirty logging in a VM backed by 1GiB HugeTLB.  During those
    seconds, if a vCPU accesses the 1GiB region being zapped it will be
    stalled until KVM finishes processing the SPTE and replaces the
    REMOVED_SPTE with 0.
    
    Re-ordering the processing does speed up the atomic-zaps somewhat, but
    the main benefit is avoiding blocking vCPU threads.
    
    Before:
    
     $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
     ...
     Disabling dirty logging time: 509.765146313s
    
     $ ./funclatency -m tdp_mmu_zap_spte_atomic
    
         msec                : count    distribution
             0 -> 1          : 0        |                                        |
             2 -> 3          : 0        |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
           128 -> 255        : 8        |**                                      |
           256 -> 511        : 68       |******************                      |
           512 -> 1023       : 129      |**********************************      |
          1024 -> 2047       : 151      |****************************************|
          2048 -> 4095       : 60       |***************                         |
    
    After:
    
     $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
     ...
     Disabling dirty logging time: 336.516838548s
    
     $ ./funclatency -m tdp_mmu_zap_spte_atomic
    
         msec                : count    distribution
             0 -> 1          : 0        |                                        |
             2 -> 3          : 0        |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
           128 -> 255        : 12       |**                                      |
           256 -> 511        : 166      |****************************************|
           512 -> 1023       : 101      |************************                |
          1024 -> 2047       : 137      |*********************************       |
    
    Note, KVM's processing of collapsible SPTEs is still extremely slow and
    can be improved.  For example, a significant amount of time is spent
    calling kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, even
    when processing SPTEs that all map the same folio.  But avoiding blocking
    vCPUs and contending SPTEs is valuable regardless of how fast KVM can
    process collapsible SPTEs.
    
    Link: https://lore.kernel.org/all/20240320005024.3216282-1-seanjc@google.com
    
    
    Cc: Vipin Sharma <vipinsh@google.com>
    Suggested-by: default avatarSean Christopherson <seanjc@google.com>
    Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
    Reviewed-by: default avatarVipin Sharma <vipinsh@google.com>
    Link: https://lore.kernel.org/r/20240307194059.1357377-1-dmatlack@google.com
    
    
    [sean: massage changelog]
    Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
    aca48556
tdp_mmu.c 56.7 KB