• Matt Roper's avatar
    drm/xe: Emit SVG state on RCS during driver load on DG2 and MTL · 72ac3047
    Matt Roper authored
    When recording the default LRC, the expectation is that the hardware's
    original state settings (both register and instruction) will be written
    out to the LRC upon first context switch.  For many 3DSTATE_* state
    instructions that don't truly have "default" values, this translates to
    a simple instruction header (opcodes + dword length) being written to
    the LRC, followed by an appropriate number of blank dwords as a place
    holder.  When userspace creates a context (which starts as a copy of the
    default LRC), they'll generally emit real 3DSTATE_* as part of their
    initialization to select the settings they desire.  If they don't emit
    one of the 3DSTATE instructions, then the zeroed dwords that remain in
    their LRC image generally translate to various state remaining disabled.
    This will either be what userspace wants or will lead to very
    reproducible and easily-debugged problems (rendering glitches, engine
    hangs).
    
    It turns out that a subset of the 3DSTATE instructions, specifically
    those belonging to the SVG (State Variable - Global) unit, are not only
    emitting 0's for the instruction's "body" dwords, but also for the
    instruction header dword if no specific state has been explicitly set
    before context switch.  This means that when the hardware switches to a
    context that hasn't explicitly provided an appropriate state setting,
    the hardware will just see a sequence of NOOPs in the spot reserved for
    that 3DSTATE instruction while executing the LRC, and the actual
    hardware state setting will unintentionally inherit the configuration
    used by the previously running context.  Now when userspace makes a
    mistake and forgets to emit an important state instruction they no
    longer get consistent, easily-reproducible corruption/hangs, but rather
    erratic behavior where the presence/absence of a problem depends on what
    other workloads are running on the system and what order the contexts
    are scheduled on the engine.
    
    A specific example of this that came up recently related to mesh shading
    The OpenGL driver was not specifically emitting a 3DSTATE_MESH_CONTROL
    to disable mesh shading at context init, so on context switch, mesh
    shading would either be on or off depending on what the previous context
    had been doing.  Vulkan apps _were_ enabling mesh shading, so running a
    Vulkan app and then context switching to an OpenGL app resulted in mesh
    shading still unexpectedly being enabled during OpenGL operation, and
    since other Mesh-related state was not properly initialized for that
    context a GPU hang was seen.  Due to the specific ordering requirements
    (Vulkan app runs first, followed by OpenGL app), it took additional
    debug effort to track down the cause of the problem.
    
    There are various workarounds related to this behavior, with current
    implementations handled in the userspace drivers.  E.g., Wa_14019789679
    and Wa_22018402687.  However it's been suggested that the kernel driver
    can help simplify things here by emitting zeroed SVG state with proper
    instruction headers as part of our default context creation (i.e., at
    the same point we apply LRC workarounds).  This will help ensure that
    any future cases where a userspace driver does not emit an important
    state setting will result in consistent behavior.
    
    Bspec: 46261
    Reviewed-by: default avatarBalasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
    Link: https://lore.kernel.org/r/20231025151732.3461842-7-matthew.d.roper@intel.comSigned-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
    Signed-off-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
    72ac3047
xe_lrc.c 30.9 KB