• Ankit Agrawal's avatar
    KVM: arm64: Introduce new flag for non-cacheable IO memory · c034ec84
    Ankit Agrawal authored
    Currently, KVM for ARM64 maps at stage 2 memory that is considered device
    (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
    overrides (as per the ARM architecture [1]) any device MMIO mapping
    present at stage 1, resulting in a set-up whereby a guest operating
    system cannot determine device MMIO mapping memory attributes on its
    own but it is always overridden by the KVM stage 2 default.
    
    This set-up does not allow guest operating systems to select device
    memory attributes independently from KVM stage-2 mappings
    (refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
    which turns out to be an issue in that guest operating systems
    (e.g. Linux) may request to map devices MMIO regions with memory
    attributes that guarantee better performance (e.g. gathering
    attribute - that for some devices can generate larger PCIe memory
    writes TLPs) and specific operations (e.g. unaligned transactions)
    such as the NormalNC memory type.
    
    The default device stage 2 mapping was chosen in KVM for ARM64 since
    it was considered safer (i.e. it would not allow guests to trigger
    uncontained failures ultimately crashing the machine) but this
    turned out to be asynchronous (SError) defeating the purpose.
    
    Failures containability is a property of the platform and is independent
    from the memory type used for MMIO device memory mappings.
    
    Actually, DEVICE_nGnRE memory type is even more problematic than
    Normal-NC memory type in terms of faults containability in that e.g.
    aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
    synchronous (i.e. that would imply that the processor should issue at
    most 1 load transaction at a time - it cannot pipeline them - otherwise
    the synchronous abort semantics would break the no-speculation attribute
    attached to DEVICE_XXX memory).
    
    This means that regardless of the combined stage1+stage2 mappings a
    platform is safe if and only if device transactions cannot trigger
    uncontained failures and that in turn relies on platform capabilities
    and the device type being assigned (i.e. PCIe AER/DPC error containment
    and RAS architecture[3]); therefore the default KVM device stage 2
    memory attributes play no role in making device assignment safer
    for a given platform (if the platform design adheres to design
    guidelines outlined in [3]) and therefore can be relaxed.
    
    For all these reasons, relax the KVM stage 2 device memory attributes
    from DEVICE_nGnRE to Normal-NC.
    
    The NormalNC was chosen over a different Normal memory type default
    at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
    
    Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
    trigger any issue on guest device reclaim use cases either (i.e. device
    MMIO unmap followed by a device reset) at least for PCIe devices, in that
    in PCIe a device reset is architected and carried out through PCI config
    space transactions that are naturally ordered with respect to MMIO
    transactions according to the PCI ordering rules.
    
    Having Normal-NC S2 default puts guests in control (thanks to
    stage1+stage2 combined memory attributes rules [1]) of device MMIO
    regions memory mappings, according to the rules described in [1]
    and summarized here ([(S1) - stage1], [(S2) - stage 2]):
    
    S1           |  S2           | Result
    NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
    NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
    NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
    DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>
    
    It is worth noting that currently, to map devices MMIO space to user
    space in a device pass-through use case the VFIO framework applies memory
    attributes derived from pgprot_noncached() settings applied to VMAs, which
    result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
    
    This means that a userspace mapping for device MMIO space carried
    out with the current VFIO framework and a guest OS mapping for the same
    MMIO space may result in a mismatched alias as described in [2].
    
    Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
    change anything in this respect, in that the mismatched aliases would
    only affect (refer to [2] for a detailed explanation) ordering between
    the userspace and GuestOS mappings resulting stream of transactions
    (i.e. it does not cause loss of property for either stream of
    transactions on its own), which is harmless given that the userspace
    and GuestOS access to the device is carried out through independent
    transactions streams.
    
    A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
    (KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
    corresponding PTE value 0x5 (0b101) determined from [1].
    
    Lastly, adapt the stage2 PTE property setter function
    (stage2_set_prot_attr) to handle the NormalNC attribute.
    
    The entire discussion leading to this patch series may be followed through
    the following links.
    Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
    Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
    
    [1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
    [2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
    [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
    Suggested-by: default avatarJason Gunthorpe <jgg@nvidia.com>
    Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    Acked-by: default avatarWill Deacon <will@kernel.org>
    Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
    Signed-off-by: default avatarAnkit Agrawal <ankita@nvidia.com>
    Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.comSigned-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
    c034ec84
memory.h 13.9 KB