• Daniel Vetter's avatar
    drm/i915: fix wait_for_pending_flips vs gpu hang deadlock · 17e1df07
    Daniel Vetter authored
    My g33 here seems to be shockingly good at hitting them all. This time
    around kms_flip/flip-vs-panning-vs-hang blows up:
    
    intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and
    if a gpu hang is pending aborts the wait for outstanding flips so that
    the setcrtc call will succeed and release the crtc mutex. And the gpu
    hang handler needs that lock in intel_display_handle_reset to be able
    to complete outstanding flips.
    
    The problem is that we can race in two ways:
    - Waiters on the dev_priv->pending_flip_queue aren't woken up after
      we've the reset as pending, but before we actually start the reset
      work. This means that the waiter doesn't notice the pending reset
      and hence will keep on hogging the locks.
    
      Like with dev->struct_mutex and the ring->irq_queue wait queues we
      there need to wake up everyone that potentially holds a lock which
      the reset handler needs.
    
    - intel_display_handle_reset was called _after_ we've already
      signalled the completion of the reset work. Which means a waiter
      could sneak in, grab the lock and never release it (since the
      pageflips won't ever get released).
    
      Similar to resetting the gem state all the reset work must complete
      before we update the reset counter. Contrary to the gem reset we
      don't need to have a second explicit wake up call since that will
      have happened already when completing the pageflips. We also don't
      have any issues that the completion happens while the reset state is
      still pending - wait_for_pending_flips is only there to ensure we
      display the right frame. After a gpu hang&reset events such
      guarantees are out the window anyway. This is in contrast to the gem
      code where too-early wake-up would result in unnecessary restarting
      of ioctls.
    
    Also, since we've gotten these various deadlocks and ordering
    constraints wrong so often throw copious amounts of comments at the
    code.
    
    This deadlock regression has been introduced in the commit which added
    the pageflip reset logic to the gpu hang work:
    
    commit 96a02917
    Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Date:   Mon Feb 18 19:08:49 2013 +0200
    
        drm/i915: Finish page flips and update primary planes after a GPU reset
    
    v2:
    - Add comments to explain how the wake_up serves as memory barriers
      for the atomic_t reset counter.
    - Improve the comments a bit as suggested by Chris Wilson.
    - Extract the wake_up calls before/after the reset into a little
      i915_error_wake_up and unconditionally wake up the
      pending_flip_queue waiters, again as suggested by Chris Wilson.
    
    v3: Throw copious amounts of comments at i915_error_wake_up as
    suggested by Chris Wilson.
    
    Cc: stable@vger.kernel.org
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
    17e1df07
i915_irq.c 93.9 KB