• Janusz Krzysztofik's avatar
    drm/i915: Fix premature release of request's reusable memory · a337b64f
    Janusz Krzysztofik authored
    Infinite waits for completion of GPU activity have been observed in CI,
    mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or
    igt@perf@stress-open-close.  Root cause analysis, based of ftrace dumps
    generated with a lot of extra trace_printk() calls added to the code,
    revealed loops of request dependencies being accidentally built,
    preventing the requests from being processed, each waiting for completion
    of another one's activity.
    
    After we substitute a new request for a last active one tracked on a
    timeline, we set up a dependency of our new request to wait on completion
    of current activity of that previous one.  While doing that, we must take
    care of keeping the old request still in memory until we use its
    attributes for setting up that await dependency, or we can happen to set
    up the await dependency on an unrelated request that already reuses the
    memory previously allocated to the old one, already released.  Combined
    with perf adding consecutive kernel context remote requests to different
    user context timelines, unresolvable loops of await dependencies can be
    built, leading do infinite waits.
    
    We obtain a pointer to the previous request to wait upon when we
    substitute it with a pointer to our new request in an active tracker,
    e.g. in intel_timeline.last_request.  In some processing paths we protect
    that old request from being freed before we use it by getting a reference
    to it under RCU protection, but in others, e.g.  __i915_request_commit()
    -> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(),
    we don't.  But anyway, since the requests' memory is SLAB_FAILSAFE_BY_RCU,
    that RCU protection is not sufficient against reuse of memory.
    
    We could protect i915_request's memory from being prematurely reused by
    calling its release function via call_rcu() and using rcu_read_lock()
    consequently, as proposed in v1.  However, that approach leads to
    significant (up to 10 times) increase of SLAB utilization by i915_request
    SLAB cache.  Another potential approach is to take a reference to the
    previous active fence.
    
    When updating an active fence tracker, we first lock the new fence,
    substitute a pointer of the current active fence with the new one, then we
    lock the substituted fence.  With this approach, there is a time window
    after the substitution and before the lock when the request can be
    concurrently released by an interrupt handler and its memory reused, then
    we may happen to lock and return a new, unrelated request.
    
    Always get a reference to the current active fence first, before
    replacing it with a new one.  Having it protected from premature release
    and reuse, lock it and then replace with the new one but only if not
    yet signalled via a potential concurrent interrupt nor replaced with
    another one by a potential concurrent thread, otherwise retry, starting
    from getting a reference to the new current one.  Adjust users to not
    get a reference to the previous active fence themselves and always put the
    reference got by __i915_active_fence_set() when no longer needed.
    
    v3: Fix lockdep splat reports and other issues caused by incorrect use of
        try_cmpxchg() (use (cmpxchg() != prev) instead)
    v2: Protect request's memory by getting a reference to it in favor of
        delegating its release to call_rcu() (Chris)
    
    Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211
    Fixes: df9f85d8 ("drm/i915: Serialise i915_active_fence_set() with itself")
    Suggested-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
    Cc: <stable@vger.kernel.org> # v5.6+
    Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
    Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20230720093543.832147-2-janusz.krzysztofik@linux.intel.com
    (cherry picked from commit 946e047a)
    Signed-off-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
    a337b64f
i915_active.c 30 KB