1. 30 Mar, 2024 2 commits
  2. 28 Mar, 2024 2 commits
  3. 26 Mar, 2024 4 commits
    • Chris Wilson's avatar
      drm/i915/gt: Reset queue_priority_hint on parking · 98850e96
      Chris Wilson authored
      Originally, with strict in order execution, we could complete execution
      only when the queue was empty. Preempt-to-busy allows replacement of an
      active request that may complete before the preemption is processed by
      HW. If that happens, the request is retired from the queue, but the
      queue_priority_hint remains set, preventing direct submission until
      after the next CS interrupt is processed.
      
      This preempt-to-busy race can be triggered by the heartbeat, which will
      also act as the power-management barrier and upon completion allow us to
      idle the HW. We may process the completion of the heartbeat, and begin
      parking the engine before the CS event that restores the
      queue_priority_hint, causing us to fail the assertion that it is MIN.
      
      <3>[  166.210729] __engine_park:283 GEM_BUG_ON(engine->sched_engine->queue_priority_hint != (-((int)(~0U >> 1)) - 1))
      <0>[  166.210781] Dumping ftrace buffer:
      <0>[  166.210795] ---------------------------------
      ...
      <0>[  167.302811] drm_fdin-1097      2..s1. 165741070us : trace_ports: 0000:00:02.0 rcs0: promote { ccid:20 1217:2 prio 0 }
      <0>[  167.302861] drm_fdin-1097      2d.s2. 165741072us : execlists_submission_tasklet: 0000:00:02.0 rcs0: preempting last=1217:2, prio=0, hint=2147483646
      <0>[  167.302928] drm_fdin-1097      2d.s2. 165741072us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 1217:2, current 0
      <0>[  167.302992] drm_fdin-1097      2d.s2. 165741073us : __i915_request_submit: 0000:00:02.0 rcs0: fence 3:4660, current 4659
      <0>[  167.303044] drm_fdin-1097      2d.s1. 165741076us : execlists_submission_tasklet: 0000:00:02.0 rcs0: context:3 schedule-in, ccid:40
      <0>[  167.303095] drm_fdin-1097      2d.s1. 165741077us : trace_ports: 0000:00:02.0 rcs0: submit { ccid:40 3:4660* prio 2147483646 }
      <0>[  167.303159] kworker/-89       11..... 165741139us : i915_request_retire.part.0: 0000:00:02.0 rcs0: fence c90:2, current 2
      <0>[  167.303208] kworker/-89       11..... 165741148us : __intel_context_do_unpin: 0000:00:02.0 rcs0: context:c90 unpin
      <0>[  167.303272] kworker/-89       11..... 165741159us : i915_request_retire.part.0: 0000:00:02.0 rcs0: fence 1217:2, current 2
      <0>[  167.303321] kworker/-89       11..... 165741166us : __intel_context_do_unpin: 0000:00:02.0 rcs0: context:1217 unpin
      <0>[  167.303384] kworker/-89       11..... 165741170us : i915_request_retire.part.0: 0000:00:02.0 rcs0: fence 3:4660, current 4660
      <0>[  167.303434] kworker/-89       11d..1. 165741172us : __intel_context_retire: 0000:00:02.0 rcs0: context:1216 retire runtime: { total:56028ns, avg:56028ns }
      <0>[  167.303484] kworker/-89       11..... 165741198us : __engine_park: 0000:00:02.0 rcs0: parked
      <0>[  167.303534]   <idle>-0         5d.H3. 165741207us : execlists_irq_handler: 0000:00:02.0 rcs0: semaphore yield: 00000040
      <0>[  167.303583] kworker/-89       11..... 165741397us : __intel_context_retire: 0000:00:02.0 rcs0: context:1217 retire runtime: { total:325575ns, avg:0ns }
      <0>[  167.303756] kworker/-89       11..... 165741777us : __intel_context_retire: 0000:00:02.0 rcs0: context:c90 retire runtime: { total:0ns, avg:0ns }
      <0>[  167.303806] kworker/-89       11..... 165742017us : __engine_park: __engine_park:283 GEM_BUG_ON(engine->sched_engine->queue_priority_hint != (-((int)(~0U >> 1)) - 1))
      <0>[  167.303811] ---------------------------------
      <4>[  167.304722] ------------[ cut here ]------------
      <2>[  167.304725] kernel BUG at drivers/gpu/drm/i915/gt/intel_engine_pm.c:283!
      <4>[  167.304731] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      <4>[  167.304734] CPU: 11 PID: 89 Comm: kworker/11:1 Tainted: G        W          6.8.0-rc2-CI_DRM_14193-gc655e0fd2804+ #1
      <4>[  167.304736] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022
      <4>[  167.304738] Workqueue: i915-unordered retire_work_handler [i915]
      <4>[  167.304839] RIP: 0010:__engine_park+0x3fd/0x680 [i915]
      <4>[  167.304937] Code: 00 48 c7 c2 b0 e5 86 a0 48 8d 3d 00 00 00 00 e8 79 48 d4 e0 bf 01 00 00 00 e8 ef 0a d4 e0 31 f6 bf 09 00 00 00 e8 03 49 c0 e0 <0f> 0b 0f 0b be 01 00 00 00 e8 f5 61 fd ff 31 c0 e9 34 fd ff ff 48
      <4>[  167.304940] RSP: 0018:ffffc9000059fce0 EFLAGS: 00010246
      <4>[  167.304942] RAX: 0000000000000200 RBX: 0000000000000000 RCX: 0000000000000006
      <4>[  167.304944] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
      <4>[  167.304946] RBP: ffff8881330ca1b0 R08: 0000000000000001 R09: 0000000000000001
      <4>[  167.304947] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8881330ca000
      <4>[  167.304948] R13: ffff888110f02aa0 R14: ffff88812d1d0205 R15: ffff88811277d4f0
      <4>[  167.304950] FS:  0000000000000000(0000) GS:ffff88844f780000(0000) knlGS:0000000000000000
      <4>[  167.304952] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4>[  167.304953] CR2: 00007fc362200c40 CR3: 000000013306e003 CR4: 0000000000770ef0
      <4>[  167.304955] PKRU: 55555554
      <4>[  167.304957] Call Trace:
      <4>[  167.304958]  <TASK>
      <4>[  167.305573]  ____intel_wakeref_put_last+0x1d/0x80 [i915]
      <4>[  167.305685]  i915_request_retire.part.0+0x34f/0x600 [i915]
      <4>[  167.305800]  retire_requests+0x51/0x80 [i915]
      <4>[  167.305892]  intel_gt_retire_requests_timeout+0x27f/0x700 [i915]
      <4>[  167.305985]  process_scheduled_works+0x2db/0x530
      <4>[  167.305990]  worker_thread+0x18c/0x350
      <4>[  167.305993]  kthread+0xfe/0x130
      <4>[  167.305997]  ret_from_fork+0x2c/0x50
      <4>[  167.306001]  ret_from_fork_asm+0x1b/0x30
      <4>[  167.306004]  </TASK>
      
      It is necessary for the queue_priority_hint to be lower than the next
      request submission upon waking up, as we rely on the hint to decide when
      to kick the tasklet to submit that first request.
      
      Fixes: 22b7a426 ("drm/i915/execlists: Preempt-to-busy")
      Closes: https://gitlab.freedesktop.org/drm/intel/issues/10154Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Cc: <stable@vger.kernel.org> # v5.4+
      Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240318135906.716055-2-janusz.krzysztofik@linux.intel.com
      98850e96
    • Janusz Krzysztofik's avatar
      Revert "drm/i915: Wait for active retire before i915_active_fini()" · d666a494
      Janusz Krzysztofik authored
      This reverts commit 7a2280e8, obsoleted
      by "drm/i915/vma: Fix UAF on destroy against retire race".
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Nirmoy Das <nirmoy.das@intel.com>
      Reviewed-by: default avatarNirmoy Das <nirmoy.das@intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240305143747.335367-8-janusz.krzysztofik@linux.intel.com
      d666a494
    • Janusz Krzysztofik's avatar
      drm/i915: Remove extra multi-gt pm-references · 1f33dc0c
      Janusz Krzysztofik authored
      There was an attempt to fix an issue of illegal attempts to free a still
      active i915 VMA object when parking a GT believed to be idle, reported by
      CI on 2-GT Meteor Lake.  As a solution, an extra wakeref for a Primary GT
      was acquired from i915_gem_do_execbuffer() -- see commit f56fe3e9
      ("drm/i915: Fix a VMA UAF for multi-gt platform").
      
      However, that fix occurred insufficient -- the issue was still reported by
      CI.  That wakeref was released on exit from i915_gem_do_execbuffer(), then
      potentially before completion of the request and deactivation of its
      associated VMAs.  Moreover, CI reports indicated that single-GT platforms
      also suffered sporadically from the same race.
      
      Since the issue has now been fixed by a preceding patch "drm/i915/vma: Fix
      UAF on destroy against retire race", drop the no longer useful changes
      introduced by that insufficient fix.
      
      v3: Also drop the no longer used .wakeref_gt0 field from struct
          i915_execbuffer.
      v2: Avoid the word "revert" in commit message (Rodrigo),
        - update commit description reusing relevant chunks dropped from the
          description of the proper fix (Rodrigo).
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Nirmoy Das <nirmoy.das@intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Reviewed-by: default avatarNirmoy Das <nirmoy.das@intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240305143747.335367-7-janusz.krzysztofik@linux.intel.com
      1f33dc0c
    • Janusz Krzysztofik's avatar
      drm/i915/vma: Fix UAF on destroy against retire race · f3c71b2d
      Janusz Krzysztofik authored
      Object debugging tools were sporadically reporting illegal attempts to
      free a still active i915 VMA object when parking a GT believed to be idle.
      
      [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915]
      [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0
      ...
      [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1
      [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022
      [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915]
      [161.360592] RIP: 0010:debug_print_object+0x80/0xb0
      ...
      [161.361347] debug_object_free+0xeb/0x110
      [161.361362] i915_active_fini+0x14/0x130 [i915]
      [161.361866] release_references+0xfe/0x1f0 [i915]
      [161.362543] i915_vma_parked+0x1db/0x380 [i915]
      [161.363129] __gt_park+0x121/0x230 [i915]
      [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915]
      
      That has been tracked down to be happening when another thread is
      deactivating the VMA inside __active_retire() helper, after the VMA's
      active counter has been already decremented to 0, but before deactivation
      of the VMA's object is reported to the object debugging tool.
      
      We could prevent from that race by serializing i915_active_fini() with
      __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from
      being used, e.g. from __i915_vma_retire() called at the end of
      __active_retire(), after that VMA has been already freed by a concurrent
      i915_vma_destroy() on return from the i915_active_fini().  Then, we should
      rather fix the issue at the VMA level, not in i915_active.
      
      Since __i915_vma_parked() is called from __gt_park() on last put of the
      GT's wakeref, the issue could be addressed by holding the GT wakeref long
      enough for __active_retire() to complete before that wakeref is released
      and the GT parked.
      
      I believe the issue was introduced by commit d9393973 ("drm/i915:
      Remove the vma refcount") which moved a call to i915_active_fini() from
      a dropped i915_vma_release(), called on last put of the removed VMA kref,
      to i915_vma_parked() processing path called on last put of a GT wakeref.
      However, its visibility to the object debugging tool was suppressed by a
      bug in i915_active that was fixed two weeks later with commit e92eb246
      ("drm/i915/active: Fix missing debug object activation").
      
      A VMA associated with a request doesn't acquire a GT wakeref by itself.
      Instead, it depends on a wakeref held directly by the request's active
      intel_context for a GT associated with its VM, and indirectly on that
      intel_context's engine wakeref if the engine belongs to the same GT as the
      VMA's VM.  Those wakerefs are released asynchronously to VMA deactivation.
      
      Fix the issue by getting a wakeref for the VMA's GT when activating it,
      and putting that wakeref only after the VMA is deactivated.  However,
      exclude global GTT from that processing path, otherwise the GPU never goes
      idle.  Since __i915_vma_retire() may be called from atomic contexts, use
      async variant of wakeref put.  Also, to avoid circular locking dependency,
      take care of acquiring the wakeref before VM mutex when both are needed.
      
      v7: Add inline comments with justifications for:
          - using untracked variants of intel_gt_pm_get/put() (Nirmoy),
          - using async variant of _put(),
          - not getting the wakeref in case of a global GTT,
          - always getting the first wakeref outside vm->mutex.
      v6: Since __i915_vma_active/retire() callbacks are not serialized, storing
          a wakeref tracking handle inside struct i915_vma is not safe, and
          there is no other good place for that.  Use untracked variants of
          intel_gt_pm_get/put_async().
      v5: Replace "tile" with "GT" across commit description (Rodrigo),
        - avoid mentioning multi-GT case in commit description (Rodrigo),
        - explain why we need to take a temporary wakeref unconditionally inside
          i915_vma_pin_ww() (Rodrigo).
      v4: Refresh on top of commit 5e4e06e4 ("drm/i915: Track gt pm
          wakerefs") (Andi),
        - for more easy backporting, split out removal of former insufficient
          workarounds and move them to separate patches (Nirmoy).
        - clean up commit message and description a bit.
      v3: Identify root cause more precisely, and a commit to blame,
        - identify and drop former workarounds,
        - update commit message and description.
      v2: Get the wakeref before VM mutex to avoid circular locking dependency,
        - drop questionable Fixes: tag.
      
      Fixes: d9393973 ("drm/i915: Remove the vma refcount")
      Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Nirmoy Das <nirmoy.das@intel.com>
      Cc: Andi Shyti <andi.shyti@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: stable@vger.kernel.org # v5.19+
      Reviewed-by: default avatarNirmoy Das <nirmoy.das@intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240305143747.335367-6-janusz.krzysztofik@linux.intel.com
      f3c71b2d
  4. 20 Mar, 2024 1 commit
  5. 13 Mar, 2024 1 commit
  6. 11 Mar, 2024 1 commit
  7. 07 Mar, 2024 5 commits
  8. 06 Mar, 2024 1 commit
  9. 05 Mar, 2024 2 commits
    • Janusz Krzysztofik's avatar
      drm/i915/selftests: Fix dependency of some timeouts on HZ · 6ee3f54b
      Janusz Krzysztofik authored
      Third argument of i915_request_wait() accepts a timeout value in jiffies.
      Most users pass either a simple HZ based expression, or a result of
      msecs_to_jiffies(), or MAX_SCHEDULE_TIMEOUT, or a very small number not
      exceeding 4 if applicable as that value.  However, there is one user --
      intel_selftest_wait_for_rq() -- that passes a WAIT_FOR_RESET_TIME symbol,
      defined as a large constant value that most probably represents a desired
      timeout in ms.  While that usage results in the intended value of timeout
      on usual x86_64 kernel configurations, it is not portable across different
      architectures and custom kernel configs.
      
      Rename the symbol to clearly indicate intended units and convert it to
      jiffies before use.
      
      Fixes: 3a4bfa09 ("drm/i915/selftest: Fix workarounds selftest for GuC submission")
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
      Cc: John Harrison <John.C.Harrison@Intel.com>
      Cc: Matthew Brost <matthew.brost@intel.com>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240222113347.648945-2-janusz.krzysztofik@linux.intel.com
      6ee3f54b
    • Janusz Krzysztofik's avatar
      drm/i915/selftest_hangcheck: Check sanity with more patience · 6616e048
      Janusz Krzysztofik authored
      While trying to reproduce some other issues reported by CI for i915
      hangcheck live selftest, I found them hidden behind timeout failures
      reported by igt_hang_sanitycheck -- the very first hangcheck test case
      executed.
      
      Feb 22 19:49:06 DUT1394ACMR kernel: calling  mei_gsc_driver_init+0x0/0xff0 [mei_gsc] @ 121074
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
      Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gsc.768 returned 0 after 1475 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gscfi.768 returned 0 after 1441 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: initcall mei_gsc_driver_init+0x0/0xff0 [mei_gsc] returned 0 after 3010 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_GEM enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_RUNTIME_PM enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915: Performing live selftests with st_random_seed=0x4c26c048 st_timeout=500
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running hangcheck
      Feb 22 19:49:07 DUT1394ACMR kernel: calling  mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] @ 121074
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running intel_hangcheck_live_selftests/igt_hang_sanitycheck
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 1398 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 97 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] returned 0 after 101960 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: calling  mei_pxp_driver_init+0x0/0xff0 [mei_pxp] @ 121094
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 435 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
      Feb 22 19:49:07 DUT1394ACMR kernel: 100ms wait for request failed on rcs0, err=-62
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 158425 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_pxp_driver_init+0x0/0xff0 [mei_pxp] returned 0 after 224159 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: i915/intel_hangcheck_live_selftests: igt_hang_sanitycheck failed with error -5
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: probe of 0000:03:00.0 failed with error -5
      
      Those request waits, once timed out after 100ms, have never been
      confirmed to still persist over another 100ms, always being able to
      complete within the originally requested wait time doubled.
      
      Taking into account potentially significant additional concurrent workload
      generated by new auxiliary drivers that didn't exist before and now are
      loaded in parallel with the i915 module also when loaded in selftest mode,
      relax our expectations on time consumed by the sanity check request before
      it completes.
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240228152500.38267-2-janusz.krzysztofik@linux.intel.com
      6616e048
  10. 04 Mar, 2024 1 commit
  11. 01 Mar, 2024 2 commits
  12. 28 Feb, 2024 1 commit
  13. 20 Feb, 2024 1 commit
  14. 15 Feb, 2024 1 commit
  15. 14 Feb, 2024 1 commit
  16. 13 Feb, 2024 1 commit
  17. 12 Feb, 2024 1 commit
  18. 24 Jan, 2024 2 commits
  19. 18 Jan, 2024 2 commits
  20. 10 Jan, 2024 1 commit
  21. 09 Jan, 2024 3 commits
    • John Harrison's avatar
      drm/i915/guc: Avoid circular locking issue on busyness flush · 0e00a881
      John Harrison authored
      Avoid the following lockdep complaint:
      <4> [298.856498] ======================================================
      <4> [298.856500] WARNING: possible circular locking dependency detected
      <4> [298.856503] 6.7.0-rc5-CI_DRM_14017-g58ac4ffc75b6+ #1 Tainted: G
          N
      <4> [298.856505] ------------------------------------------------------
      <4> [298.856507] kworker/4:1H/190 is trying to acquire lock:
      <4> [298.856509] ffff8881103e9978 (&gt->reset.backoff_srcu){++++}-{0:0}, at:
      _intel_gt_reset_lock+0x35/0x380 [i915]
      <4> [298.856661]
      but task is already holding lock:
      <4> [298.856663] ffffc900013f7e58
      ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at:
      process_scheduled_works+0x264/0x530
      <4> [298.856671]
      which lock already depends on the new lock.
      
      The complaint is not actually valid. The busyness worker thread does
      indeed hold the worker lock and then attempt to acquire the reset lock
      (which may have happened in reverse order elsewhere). However, it does
      so with a trylock that exits if the reset lock is not available
      (specifically to prevent this and other similar deadlocks).
      Unfortunately, lockdep does not understand the trylock semantics (the
      lock is an i915 specific custom implementation for resets).
      
      Not doing a synchronous flush of the worker thread when a reset is in
      progress resolves the lockdep splat by never even attempting to grab
      the lock in this particular scenario.
      
      There are situatons where a synchronous cancel is required, however.
      So, always do the synchronous cancel if not in reset. And add an extra
      synchronous cancel to the end of the reset flow to account for when a
      reset is occurring at driver shutdown and the cancel is no longer
      synchronous but could lead to unallocated memory accesses if the
      worker is not stopped.
      Signed-off-by: default avatarZhanjun Dong <zhanjun.dong@intel.com>
      Signed-off-by: default avatarJohn Harrison <John.C.Harrison@Intel.com>
      Cc: Andi Shyti <andi.shyti@linux.intel.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231219195957.212600-1-John.C.Harrison@Intel.com
      0e00a881
    • Alan Previn's avatar
      drm/i915/guc: Close deregister-context race against CT-loss · 2f2cc53b
      Alan Previn authored
      If we are at the end of suspend or very early in resume
      its possible an async fence signal (via rcu_call) is triggered
      to free_engines which could lead us to the execution of
      the context destruction worker (after a prior worker flush).
      
      Thus, when suspending, insert rcu_barriers at the start
      of i915_gem_suspend (part of driver's suspend prepare) and
      again in i915_gem_suspend_late so that all such cases have
      completed and context destruction list isn't missing anything.
      
      In destroyed_worker_func, close the race against CT-loss
      by checking that CT is enabled before calling into
      deregister_destroyed_contexts.
      
      Based on testing, guc_lrc_desc_unpin may still race and fail
      as we traverse the GuC's context-destroy list because the
      CT could be disabled right before calling GuC's CT send function.
      
      We've witnessed this race condition once every ~6000-8000
      suspend-resume cycles while ensuring workloads that render
      something onscreen is continuously started just before
      we suspend (and the workload is small enough to complete
      and trigger the queued engine/context free-up either very
      late in suspend or very early in resume).
      
      In such a case, we need to unroll the entire process because
      guc-lrc-unpin takes a gt wakeref which only gets released in
      the G2H IRQ reply that never comes through in this corner
      case. Without the unroll, the taken wakeref is leaked and will
      cascade into a kernel hang later at the tail end of suspend in
      this function:
      
         intel_wakeref_wait_for_idle(&gt->wakeref)
         (called by) - intel_gt_pm_wait_for_idle
         (called by) - wait_for_suspend
      
      Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
      contexts if guc_lrc_desc_unpin fails due to CT send falure.
      When unrolling, keep the context in the GuC's destroy-list so
      it can get picked up on the next destroy worker invocation
      (if suspend aborted) or get fully purged as part of a GuC
      sanitization (end of suspend) or a reset flow.
      Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
      Signed-off-by: default avatarAnshuman Gupta <anshuman.gupta@intel.com>
      Tested-by: default avatarMousumi Jana <mousumi.jana@intel.com>
      Acked-by: default avatarDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231229215143.581619-1-alan.previn.teres.alexis@intel.com
      2f2cc53b
    • Alan Previn's avatar
      drm/i915/guc: Flush context destruction worker at suspend · 5e83c060
      Alan Previn authored
      When suspending, flush the context-guc-id
      deregistration worker at the final stages of
      intel_gt_suspend_late when we finally call gt_sanitize
      that eventually leads down to __uc_sanitize so that
      the deregistration worker doesn't fire off later as
      we reset the GuC microcontroller.
      Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
      Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Tested-by: default avatarMousumi Jana <mousumi.jana@intel.com>
      Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231228045558.536585-2-alan.previn.teres.alexis@intel.com
      5e83c060
  22. 06 Jan, 2024 1 commit
  23. 05 Jan, 2024 2 commits
  24. 02 Jan, 2024 1 commit