• John Harrison's avatar
    drm/i915/guc: Fix error capture for virtual engines · e4730ae4
    John Harrison authored
    GuC based register dumps in error capture logs were basically broken
    for virtual engines. This can be seen in igt@gem_exec_balancer@hang:
      [IGT] gem_exec_balancer: starting subtest hang
      [drm] GPU HANG: ecode 12:4:e1524110, in gem_exec_balanc [6388]
      [drm] GT0: GUC: No register capture node found for 0x1005 / 0xFEDC311D
      [drm] GPU HANG: ecode 12:4:00000000, in gem_exec_balanc [6388]
      [IGT] gem_exec_balancer: exiting, ret=0
    
    The test causes a hang on both engines of a virtual engine context.
    The engine instance zero hang gets a valid error capture but the
    non-instance-zero hang does not.
    
    Fix that by scanning through the list of pending register captures
    when a hang notification for a virtual engine is received. That way,
    the hang can be assigned to the correct physical engine prior to
    starting the error capture process. So later on, when the error capture
    handler tries to find the engine register list, it looks for one on
    the correct engine.
    
    Also, sneak in a missing blank line before a comment in the node
    search code.
    
    v2: Fix null pointer deref on non-GuC platforms.
    Signed-off-by: default avatarJohn Harrison <John.C.Harrison@Intel.com>
    Reviewed-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20230428185636.457407-5-John.C.Harrison@Intel.com
    e4730ae4
intel_guc_capture.h 1.27 KB