• Mika Kuoppala's avatar
    drm/i915: Decouple hang detection from hangcheck period · 3fe3b030
    Mika Kuoppala authored
    Hangcheck state accumulation has gained more steps
    along the years, like head movement and more recently the
    subunit inactivity check. As the subunit sampling is only
    done if the previous state check showed inactivity, we
    have added more stages (and time) to reach a hang verdict.
    
    Asymmetric engine states led to different actual weight of
    'one hangcheck unit' and it was demonstrated in some
    hangs that due to difference in stages, simpler engines
    were accused falsely of a hang as their scoring was much
    more quicker to accumulate above the hang treshold.
    
    To completely decouple the hangcheck guilty score
    from the hangcheck period, convert hangcheck score to a
    rough period of inactivity measurement. As these are
    tracked as jiffies, they are meaningful also across
    reset boundaries. This makes finding a guilty engine
    more accurate across multi engine activity scenarios,
    especially across asymmetric engines.
    
    We lose the ability to detect cross batch malicious attempts
    to hinder the progress. Plan is to move this functionality
    to be part of context banning which is more natural fit,
    later in the series.
    
    v2: use time_before macros (Chris)
        reinstate the pardoning of moving engine after hc (Chris)
    v3: avoid global state for per engine stall detection (Chris)
    v4: take timeline last retirement into account (Chris)
    v5: do debug print on pardoning, split out retirement timestamp (Chris)
    
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: default avatarMika Kuoppala <mika.kuoppala@intel.com>
    3fe3b030
intel_hangcheck.c 13.4 KB