- 19 Apr, 2024 1 commit
-
-
Daniele Ceraolo Spurio authored
On DG2, submissions to VCS engines tied to a gem context are blocked until the HuC is loaded. Since some selftests do use a gem context, wait for the HuC load to complete before running the tests to avoid contamination. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10564Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240410201505.894594-1-daniele.ceraolospurio@intel.com
-
- 14 Apr, 2024 1 commit
-
-
Christophe JAILLET authored
ida_alloc() and ida_free() should be preferred to the deprecated ida_simple_get() and ida_simple_remove(). Note that the upper limit of ida_simple_get() is exclusive, but the one of ida_alloc_range() is inclusive. So a -1 has been added when needed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/7108c1871c6cb08d403c4fa6534bc7e6de4cb23d.1705245316.git.christophe.jaillet@wanadoo.frSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
-
- 05 Apr, 2024 1 commit
-
-
John Harrison authored
The previous fix for the circlular lock splat about the busyness worker wasn't quite complete. Even though the reset-in-progress flag is cleared at the start of intel_uc_reset_finish, the entire function is still inside the reset mutex lock. Not sure why the patch appeared to fix the issue both locally and in CI. However, it is now back again. There is a further complication that the wedge code path within intel_gt_reset() jumps around so much that it results in nested reset_prepare/_finish calls. That is, the call sequence is: intel_gt_reset | reset_prepare | __intel_gt_set_wedged | | reset_prepare | | reset_finish | reset_finish The nested finish means that even if the clear of the in-progress flag was moved to the end of _finish, it would still be clear for the entire second call. Surprisingly, this does not seem to be causing any other problems at present. As an aside, a wedge on fini does not call the finish functions at all. The reset_in_progress flag is left set (twice). So instead of trying to cancel the worker anywhere at all in the reset path, just add a cancel to intel_guc_submission_fini instead. Note that it is not a problem if the worker is still active during a reset. Either it will run before the reset path starts locking things and will simply block the reset code for a tiny amount of time. Or it will run after the locks have been acquired and will early exit due to the try-lock. Also, do not use the reset-in-progress flag to decide whether a synchronous cancel is safe (from a lockdep perspective) or not. Instead, use the actual reset mutex state (both the genuine one and the custom rolled BACKOFF one). Fixes: 0e00a881 ("drm/i915/guc: Avoid circular locking issue on busyness flush") Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Cc: Andrzej Hajda <andrzej.hajda@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Prathap Kumar Valsan <prathap.kumar.valsan@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: Madhumitha Tolakanahalli Pradeep <madhumitha.tolakanahalli.pradeep@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240329235306.1559639-1-John.C.Harrison@Intel.com
-
- 03 Apr, 2024 1 commit
-
-
Rodrigo Vivi authored
This null check is bogus because we are already using 'ce' stuff in many places before this function is called. Having this here is useless and confuses static analyzer tools that can see: struct intel_engine_cs *engine = ce->engine; before this check, in the same function. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202403101225.7AheJhZJ-lkp@intel.com/ Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240328213107.90632-1-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
-
- 30 Mar, 2024 3 commits
-
-
Andi Shyti authored
Enable only one CCS engine by default with all the compute sices allocated to it. While generating the list of UABI engines to be exposed to the user, exclude any additional CCS engines beyond the first instance. This change can be tested with igt i915_query. Fixes: d2eae8e9 ("drm/i915/dg2: Drop force_probe requirement") Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Cc: Chris Wilson <chris.p.wilson@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: <stable@vger.kernel.org> # v6.2+ Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240328073409.674098-4-andi.shyti@linux.intel.com
-
Andi Shyti authored
We want a fixed load CCS balancing consisting in all slices sharing one single user engine. For this reason do not create the intel_engine_cs structure with its dedicated command streamer for CCS slices beyond the first. Fixes: d2eae8e9 ("drm/i915/dg2: Drop force_probe requirement") Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Cc: Chris Wilson <chris.p.wilson@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: <stable@vger.kernel.org> # v6.2+ Acked-by: Michal Mrozek <michal.mrozek@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240328073409.674098-3-andi.shyti@linux.intel.com
-
Andi Shyti authored
The hardware should not dynamically balance the load between CCS engines. Wa_14019159160 recommends disabling it across all platforms. Fixes: d2eae8e9 ("drm/i915/dg2: Drop force_probe requirement") Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Cc: Chris Wilson <chris.p.wilson@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: <stable@vger.kernel.org> # v6.2+ Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240328073409.674098-2-andi.shyti@linux.intel.com
-
- 28 Mar, 2024 2 commits
-
-
Andi Shyti authored
Anyone using 'dev_priv' instead of 'i915' in a cleaned-up area should be fined and required to do community service for a few days. Using 'i915' instead of 'dev_priv' has been the preferred practice over the past years and some effort has been spent to replace 'dev_priv' with 'i915'. Therefore, 'dev_priv' should almost never be used (unless it breaks some defines which are dependent on the naming). I thought I had cleaned up the 'gem/' directory in the past, but still, old aficionados of the 'dev_priv' name keep sneaking it in. Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Reviewed-by: Tvrtko Ursulin <tursulin@ursulin.net> Link: https://patchwork.freedesktop.org/patch/msgid/20240328071833.664001-1-andi.shyti@linux.intel.com
-
Andi Shyti authored
Commit 9bb66c17 ("drm/i915: Reserve some kernel space per vm") reduces the available VM space of one page in order to apply Wa_16018031267 and Wa_16018063123. This page was reserved indiscrimitely in all platforms even when not needed. Limit it to DG2 onwards. Fixes: 9bb66c17 ("drm/i915: Reserve some kernel space per vm") Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Cc: Andrzej Hajda <andrzej.hajda@intel.com> Cc: Chris Wilson <chris.p.wilson@linux.intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240327200546.640108-1-andi.shyti@linux.intel.com
-
- 26 Mar, 2024 4 commits
-
-
Chris Wilson authored
Originally, with strict in order execution, we could complete execution only when the queue was empty. Preempt-to-busy allows replacement of an active request that may complete before the preemption is processed by HW. If that happens, the request is retired from the queue, but the queue_priority_hint remains set, preventing direct submission until after the next CS interrupt is processed. This preempt-to-busy race can be triggered by the heartbeat, which will also act as the power-management barrier and upon completion allow us to idle the HW. We may process the completion of the heartbeat, and begin parking the engine before the CS event that restores the queue_priority_hint, causing us to fail the assertion that it is MIN. <3>[ 166.210729] __engine_park:283 GEM_BUG_ON(engine->sched_engine->queue_priority_hint != (-((int)(~0U >> 1)) - 1)) <0>[ 166.210781] Dumping ftrace buffer: <0>[ 166.210795] --------------------------------- ... <0>[ 167.302811] drm_fdin-1097 2..s1. 165741070us : trace_ports: 0000:00:02.0 rcs0: promote { ccid:20 1217:2 prio 0 } <0>[ 167.302861] drm_fdin-1097 2d.s2. 165741072us : execlists_submission_tasklet: 0000:00:02.0 rcs0: preempting last=1217:2, prio=0, hint=2147483646 <0>[ 167.302928] drm_fdin-1097 2d.s2. 165741072us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 1217:2, current 0 <0>[ 167.302992] drm_fdin-1097 2d.s2. 165741073us : __i915_request_submit: 0000:00:02.0 rcs0: fence 3:4660, current 4659 <0>[ 167.303044] drm_fdin-1097 2d.s1. 165741076us : execlists_submission_tasklet: 0000:00:02.0 rcs0: context:3 schedule-in, ccid:40 <0>[ 167.303095] drm_fdin-1097 2d.s1. 165741077us : trace_ports: 0000:00:02.0 rcs0: submit { ccid:40 3:4660* prio 2147483646 } <0>[ 167.303159] kworker/-89 11..... 165741139us : i915_request_retire.part.0: 0000:00:02.0 rcs0: fence c90:2, current 2 <0>[ 167.303208] kworker/-89 11..... 165741148us : __intel_context_do_unpin: 0000:00:02.0 rcs0: context:c90 unpin <0>[ 167.303272] kworker/-89 11..... 165741159us : i915_request_retire.part.0: 0000:00:02.0 rcs0: fence 1217:2, current 2 <0>[ 167.303321] kworker/-89 11..... 165741166us : __intel_context_do_unpin: 0000:00:02.0 rcs0: context:1217 unpin <0>[ 167.303384] kworker/-89 11..... 165741170us : i915_request_retire.part.0: 0000:00:02.0 rcs0: fence 3:4660, current 4660 <0>[ 167.303434] kworker/-89 11d..1. 165741172us : __intel_context_retire: 0000:00:02.0 rcs0: context:1216 retire runtime: { total:56028ns, avg:56028ns } <0>[ 167.303484] kworker/-89 11..... 165741198us : __engine_park: 0000:00:02.0 rcs0: parked <0>[ 167.303534] <idle>-0 5d.H3. 165741207us : execlists_irq_handler: 0000:00:02.0 rcs0: semaphore yield: 00000040 <0>[ 167.303583] kworker/-89 11..... 165741397us : __intel_context_retire: 0000:00:02.0 rcs0: context:1217 retire runtime: { total:325575ns, avg:0ns } <0>[ 167.303756] kworker/-89 11..... 165741777us : __intel_context_retire: 0000:00:02.0 rcs0: context:c90 retire runtime: { total:0ns, avg:0ns } <0>[ 167.303806] kworker/-89 11..... 165742017us : __engine_park: __engine_park:283 GEM_BUG_ON(engine->sched_engine->queue_priority_hint != (-((int)(~0U >> 1)) - 1)) <0>[ 167.303811] --------------------------------- <4>[ 167.304722] ------------[ cut here ]------------ <2>[ 167.304725] kernel BUG at drivers/gpu/drm/i915/gt/intel_engine_pm.c:283! <4>[ 167.304731] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI <4>[ 167.304734] CPU: 11 PID: 89 Comm: kworker/11:1 Tainted: G W 6.8.0-rc2-CI_DRM_14193-gc655e0fd2804+ #1 <4>[ 167.304736] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 <4>[ 167.304738] Workqueue: i915-unordered retire_work_handler [i915] <4>[ 167.304839] RIP: 0010:__engine_park+0x3fd/0x680 [i915] <4>[ 167.304937] Code: 00 48 c7 c2 b0 e5 86 a0 48 8d 3d 00 00 00 00 e8 79 48 d4 e0 bf 01 00 00 00 e8 ef 0a d4 e0 31 f6 bf 09 00 00 00 e8 03 49 c0 e0 <0f> 0b 0f 0b be 01 00 00 00 e8 f5 61 fd ff 31 c0 e9 34 fd ff ff 48 <4>[ 167.304940] RSP: 0018:ffffc9000059fce0 EFLAGS: 00010246 <4>[ 167.304942] RAX: 0000000000000200 RBX: 0000000000000000 RCX: 0000000000000006 <4>[ 167.304944] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009 <4>[ 167.304946] RBP: ffff8881330ca1b0 R08: 0000000000000001 R09: 0000000000000001 <4>[ 167.304947] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8881330ca000 <4>[ 167.304948] R13: ffff888110f02aa0 R14: ffff88812d1d0205 R15: ffff88811277d4f0 <4>[ 167.304950] FS: 0000000000000000(0000) GS:ffff88844f780000(0000) knlGS:0000000000000000 <4>[ 167.304952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 167.304953] CR2: 00007fc362200c40 CR3: 000000013306e003 CR4: 0000000000770ef0 <4>[ 167.304955] PKRU: 55555554 <4>[ 167.304957] Call Trace: <4>[ 167.304958] <TASK> <4>[ 167.305573] ____intel_wakeref_put_last+0x1d/0x80 [i915] <4>[ 167.305685] i915_request_retire.part.0+0x34f/0x600 [i915] <4>[ 167.305800] retire_requests+0x51/0x80 [i915] <4>[ 167.305892] intel_gt_retire_requests_timeout+0x27f/0x700 [i915] <4>[ 167.305985] process_scheduled_works+0x2db/0x530 <4>[ 167.305990] worker_thread+0x18c/0x350 <4>[ 167.305993] kthread+0xfe/0x130 <4>[ 167.305997] ret_from_fork+0x2c/0x50 <4>[ 167.306001] ret_from_fork_asm+0x1b/0x30 <4>[ 167.306004] </TASK> It is necessary for the queue_priority_hint to be lower than the next request submission upon waking up, as we rely on the hint to decide when to kick the tasklet to submit that first request. Fixes: 22b7a426 ("drm/i915/execlists: Preempt-to-busy") Closes: https://gitlab.freedesktop.org/drm/intel/issues/10154Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: <stable@vger.kernel.org> # v5.4+ Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240318135906.716055-2-janusz.krzysztofik@linux.intel.com
-
Janusz Krzysztofik authored
This reverts commit 7a2280e8, obsoleted by "drm/i915/vma: Fix UAF on destroy against retire race". Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240305143747.335367-8-janusz.krzysztofik@linux.intel.com
-
Janusz Krzysztofik authored
There was an attempt to fix an issue of illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle, reported by CI on 2-GT Meteor Lake. As a solution, an extra wakeref for a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit f56fe3e9 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, that fix occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. Moreover, CI reports indicated that single-GT platforms also suffered sporadically from the same race. Since the issue has now been fixed by a preceding patch "drm/i915/vma: Fix UAF on destroy against retire race", drop the no longer useful changes introduced by that insufficient fix. v3: Also drop the no longer used .wakeref_gt0 field from struct i915_execbuffer. v2: Avoid the word "revert" in commit message (Rodrigo), - update commit description reusing relevant chunks dropped from the description of the proper fix (Rodrigo). Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240305143747.335367-7-janusz.krzysztofik@linux.intel.com
-
Janusz Krzysztofik authored
Object debugging tools were sporadically reporting illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle. [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 ... [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 ... [161.361347] debug_object_free+0xeb/0x110 [161.361362] i915_active_fini+0x14/0x130 [i915] [161.361866] release_references+0xfe/0x1f0 [i915] [161.362543] i915_vma_parked+0x1db/0x380 [i915] [161.363129] __gt_park+0x121/0x230 [i915] [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] That has been tracked down to be happening when another thread is deactivating the VMA inside __active_retire() helper, after the VMA's active counter has been already decremented to 0, but before deactivation of the VMA's object is reported to the object debugging tool. We could prevent from that race by serializing i915_active_fini() with __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from being used, e.g. from __i915_vma_retire() called at the end of __active_retire(), after that VMA has been already freed by a concurrent i915_vma_destroy() on return from the i915_active_fini(). Then, we should rather fix the issue at the VMA level, not in i915_active. Since __i915_vma_parked() is called from __gt_park() on last put of the GT's wakeref, the issue could be addressed by holding the GT wakeref long enough for __active_retire() to complete before that wakeref is released and the GT parked. I believe the issue was introduced by commit d9393973 ("drm/i915: Remove the vma refcount") which moved a call to i915_active_fini() from a dropped i915_vma_release(), called on last put of the removed VMA kref, to i915_vma_parked() processing path called on last put of a GT wakeref. However, its visibility to the object debugging tool was suppressed by a bug in i915_active that was fixed two weeks later with commit e92eb246 ("drm/i915/active: Fix missing debug object activation"). A VMA associated with a request doesn't acquire a GT wakeref by itself. Instead, it depends on a wakeref held directly by the request's active intel_context for a GT associated with its VM, and indirectly on that intel_context's engine wakeref if the engine belongs to the same GT as the VMA's VM. Those wakerefs are released asynchronously to VMA deactivation. Fix the issue by getting a wakeref for the VMA's GT when activating it, and putting that wakeref only after the VMA is deactivated. However, exclude global GTT from that processing path, otherwise the GPU never goes idle. Since __i915_vma_retire() may be called from atomic contexts, use async variant of wakeref put. Also, to avoid circular locking dependency, take care of acquiring the wakeref before VM mutex when both are needed. v7: Add inline comments with justifications for: - using untracked variants of intel_gt_pm_get/put() (Nirmoy), - using async variant of _put(), - not getting the wakeref in case of a global GTT, - always getting the first wakeref outside vm->mutex. v6: Since __i915_vma_active/retire() callbacks are not serialized, storing a wakeref tracking handle inside struct i915_vma is not safe, and there is no other good place for that. Use untracked variants of intel_gt_pm_get/put_async(). v5: Replace "tile" with "GT" across commit description (Rodrigo), - avoid mentioning multi-GT case in commit description (Rodrigo), - explain why we need to take a temporary wakeref unconditionally inside i915_vma_pin_ww() (Rodrigo). v4: Refresh on top of commit 5e4e06e4 ("drm/i915: Track gt pm wakerefs") (Andi), - for more easy backporting, split out removal of former insufficient workarounds and move them to separate patches (Nirmoy). - clean up commit message and description a bit. v3: Identify root cause more precisely, and a commit to blame, - identify and drop former workarounds, - update commit message and description. v2: Get the wakeref before VM mutex to avoid circular locking dependency, - drop questionable Fixes: tag. Fixes: d9393973 ("drm/i915: Remove the vma refcount") Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: stable@vger.kernel.org # v5.19+ Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240305143747.335367-6-janusz.krzysztofik@linux.intel.com
-
- 20 Mar, 2024 1 commit
-
-
Radhakrishna Sripada authored
Disable clockgating for TDL SVHS fub. v2: Implement in general render/compute wa's(MattR) Bspec: 46045 Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Radhakrishna Sripada <radhakrishna.sripada@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240318210025.562698-1-radhakrishna.sripada@intel.com
-
- 13 Mar, 2024 1 commit
-
-
Nirmoy Das authored
Caching mode is HW dependent so pick a correct one using intel_gt_coherent_map_type(). Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10249Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Acked-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240312111815.18083-1-nirmoy.das@intel.com
-
- 11 Mar, 2024 1 commit
-
-
Lucas De Marchi authored
With dynamic load-balancing disabled on the compute side, there's no reason left to enable WA 16015675438. Drop it from both PVC and DG2. Note that this can be done because now the driver always set a fixed partition of EUs during initialization via the ccs_mode configuration. The flag to GuC is still needed because of 18020744125, so update the comment accordingly. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Mateusz Jablonski <mateusz.jablonski@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240306144723.1826977-1-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
-
- 07 Mar, 2024 5 commits
-
-
John Harrison authored
Use the new w/a KLV support to enable a MTL w/a. Note, this w/a is a super-set of Wa_16019325821, so requires turning that one as well as setting the new flag for Wa_14019159160 itself. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240223205632.1621019-4-John.C.Harrison@Intel.com
-
John Harrison authored
To prevent running out of bits, new w/a enable flags are being added via a KLV system instead of a 32 bit flags word. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240223205632.1621019-3-John.C.Harrison@Intel.com
-
John Harrison authored
Some platforms require holding RCS context switches until CCS is idle (the reverse w/a of Wa_14014475959). Some platforms require both versions. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240223205632.1621019-2-John.C.Harrison@Intel.com
-
Vinay Belgaumkar authored
Allow user to provide a low latency context hint. When set, KMD sends a hint to GuC which results in special handling for this context. SLPC will ramp the GT frequency aggressively every time it switches to this context. The down freq threshold will also be lower so GuC will ramp down the GT freq for this context more slowly. We also disable waitboost for this context as that will interfere with the strategy. We need to enable the use of SLPC Compute strategy during init, but it will apply only to contexts that set this bit during context creation. Userland can check whether this feature is supported using a new param- I915_PARAM_HAS_CONTEXT_FREQ_HINT. This flag is true for all guc submission enabled platforms as they use SLPC for frequency management. The Mesa usage model for this flag is here - https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint v2: Rename flags as per review suggestions (Rodrigo, Tvrtko). Also, use flag bits in intel_context as it allows finer control for toggling per engine if needed (Tvrtko). v3: Minor review comments (Tvrtko) v4: Update comment (Sushma) Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Ivan Briano <ivan.briano@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240306012759.204938-1-vinay.belgaumkar@intel.com
-
Tejas Upadhyay authored
Applying WA 14018575942 only on Compute engine has impact on some apps like chrome. Updating this WA to apply on Render engine as well as it is helping with performance on Chrome. Note: There is no concern from media team thus not applying WA on media engines. We will revisit if any issues reported from media team. V2(Matt): - Use correct WA number Fixes: 668f37e1 ("drm/i915/mtl: Update workaround 14018778641") Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240228103738.2018458-1-tejas.upadhyay@intel.com
-
- 06 Mar, 2024 1 commit
-
-
John Harrison authored
The above w/a is required for every platform that the i915 driver supports. It is fixed on the latest platforms but they are only supported by Xe instead of i915. So just remove the platform check completely and keep the code simple. v2: Add extra comment (review feedback from Rodrigo). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240223202846.1532176-1-John.C.Harrison@Intel.com
-
- 05 Mar, 2024 2 commits
-
-
Janusz Krzysztofik authored
Third argument of i915_request_wait() accepts a timeout value in jiffies. Most users pass either a simple HZ based expression, or a result of msecs_to_jiffies(), or MAX_SCHEDULE_TIMEOUT, or a very small number not exceeding 4 if applicable as that value. However, there is one user -- intel_selftest_wait_for_rq() -- that passes a WAIT_FOR_RESET_TIME symbol, defined as a large constant value that most probably represents a desired timeout in ms. While that usage results in the intended value of timeout on usual x86_64 kernel configurations, it is not portable across different architectures and custom kernel configs. Rename the symbol to clearly indicate intended units and convert it to jiffies before use. Fixes: 3a4bfa09 ("drm/i915/selftest: Fix workarounds selftest for GuC submission") Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Rahul Kumar Singh <rahul.kumar.singh@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240222113347.648945-2-janusz.krzysztofik@linux.intel.com
-
Janusz Krzysztofik authored
While trying to reproduce some other issues reported by CI for i915 hangcheck live selftest, I found them hidden behind timeout failures reported by igt_hang_sanitycheck -- the very first hangcheck test case executed. Feb 22 19:49:06 DUT1394ACMR kernel: calling mei_gsc_driver_init+0x0/0xff0 [mei_gsc] @ 121074 Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG enabled Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gsc.768 returned 0 after 1475 usecs Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gscfi.768 returned 0 after 1441 usecs Feb 22 19:49:06 DUT1394ACMR kernel: initcall mei_gsc_driver_init+0x0/0xff0 [mei_gsc] returned 0 after 3010 usecs Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_GEM enabled Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_RUNTIME_PM enabled Feb 22 19:49:06 DUT1394ACMR kernel: i915: Performing live selftests with st_random_seed=0x4c26c048 st_timeout=500 Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running hangcheck Feb 22 19:49:07 DUT1394ACMR kernel: calling mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] @ 121074 Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running intel_hangcheck_live_selftests/igt_hang_sanitycheck Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 1398 usecs Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 97 usecs Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] returned 0 after 101960 usecs Feb 22 19:49:07 DUT1394ACMR kernel: calling mei_pxp_driver_init+0x0/0xff0 [mei_pxp] @ 121094 Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 435 usecs Feb 22 19:49:07 DUT1394ACMR kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915]) Feb 22 19:49:07 DUT1394ACMR kernel: 100ms wait for request failed on rcs0, err=-62 Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 158425 usecs Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_pxp_driver_init+0x0/0xff0 [mei_pxp] returned 0 after 224159 usecs Feb 22 19:49:07 DUT1394ACMR kernel: i915/intel_hangcheck_live_selftests: igt_hang_sanitycheck failed with error -5 Feb 22 19:49:07 DUT1394ACMR kernel: i915: probe of 0000:03:00.0 failed with error -5 Those request waits, once timed out after 100ms, have never been confirmed to still persist over another 100ms, always being able to complete within the originally requested wait time doubled. Taking into account potentially significant additional concurrent workload generated by new auxiliary drivers that didn't exist before and now are loaded in parallel with the i915 module also when loaded in selftest mode, relax our expectations on time consumed by the sanity check request before it completes. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240228152500.38267-2-janusz.krzysztofik@linux.intel.com
-
- 04 Mar, 2024 1 commit
-
-
John Harrison authored
The EIR register (0x20B0) was being included in the engine class list for render and compute as the absolute register address. However, it is actually a ring register available on all engines at an offset of (base) + 0xB0. As it was included as an RCS engine but with the absolute address, GuC was adding on another 0x2000 and coming out at an invalid location. Thus it would reject the register and complain about only managing a partial capture. So update the list to use the RING_EIR version of the register and include it for all engines. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240223203204.1533410-1-John.C.Harrison@Intel.com
-
- 01 Mar, 2024 2 commits
-
-
Andi Shyti authored
Get the guc reference from the gt using the gt_to_guc() helper. Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231229102734.674362-3-andi.shyti@linux.intel.com
-
Andi Shyti authored
We already have guc_to_gt() and getting to guc from the GT it requires some mental effort. Add the gt_to_guc(). Given the reference to the "gt", the gt_to_guc() will return the pinter to the "guc". Update all the files under the gt/ directory. Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231229102734.674362-2-andi.shyti@linux.intel.com
-
- 28 Feb, 2024 1 commit
-
-
Nirmoy Das authored
Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Fixes: ed29c269 ("drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.") Cc: <stable@vger.kernel.org> # v5.13+ [tursulin: Added Fixes and cc stable.] Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Shawn Lee <shawn.c.lee@intel.com> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240219125047.28906-1-nirmoy.das@intel.comSigned-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
-
- 20 Feb, 2024 1 commit
-
-
Tvrtko Ursulin authored
Tooling appears very strict so lets pacify it by adding some comments, even if fields are completely self-explanatory. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Fixes: b1123648 ("drm/i915: Add GuC submission interface version query") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Jose Souza <jose.souza@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240219132517.1868604-1-tvrtko.ursulin@linux.intel.com
-
- 15 Feb, 2024 1 commit
-
-
John Harrison authored
The context persistence code does things like send super high priority heartbeat pulses to ensure any leaked context can still be pre-empted and thus isn't a total denial of service but only a minor denial of service. Unfortunately, it wasn't bothering to restart the heartbeat worker with a fresh timeout. Thus, if a persistent context happened to be closed just before the heartbeat was going to go ping anyway then the forced pulse would get a negligble execution time. And as the forced pulse is super high priority, the worker thread's next step is a reset. Which means a potentially innocent system randomly goes boom when attempting to close a context. So, force a re-schedule of the worker thread with the appropriate timeout. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240110210216.4125092-1-John.C.Harrison@Intel.com
-
- 14 Feb, 2024 1 commit
-
-
Tvrtko Ursulin authored
Add a new query to the GuC submission interface version. Mesa intends to use this information to check for old firmware versions with a known bug where using the render and compute command streamers simultaneously can cause GPU hangs due issues in firmware scheduling. Based on patches from Vivaik and Joonas. Compile tested only. v2: * Added branch version. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Kenneth Graunke <kenneth@whitecape.org> Cc: Jose Souza <jose.souza@intel.com> Cc: Sagar Ghuge <sagar.ghuge@intel.com> Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Vivaik Balasubrawmanian <vivaik.balasubrawmanian@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Tested-by: José Roberto de Souza <jose.souza@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240208082510.1363268-1-tvrtko.ursulin@linux.intel.com
-
- 13 Feb, 2024 1 commit
-
-
Anirban Sk authored
Sometimes gt_pm live_rc6_manual selftest fails due to no power being measured for the rc6 disabled period. Therefore increasing the rc6 disable period from 250ms to 1000ms to rule out such sporadic failure. v3: - More descriptive and improved commit message (Anshuman) Signed-off-by: Anirban Sk <sk.anirban@intel.com> Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240212050738.1162198-1-sk.anirban@intel.com
-
- 12 Feb, 2024 1 commit
-
-
Erick Archer authored
The "struct i915_syncmap" uses a dynamically sized set of trailing elements. It can use an "u32" array or a "struct i915_syncmap *" array. So, use the preferred way in the kernel declaring flexible arrays [1]. Because there are two possibilities for the trailing arrays, it is necessary to declare a union and use the DECLARE_FLEX_ARRAY macro. The comment can be removed as the union is now clear enough. Also, avoid the open-coded arithmetic in the memory allocator functions [2] using the "struct_size" macro. Moreover, refactor the "__sync_seqno" and "__sync_child" functions due to now it is possible to use the union members added to the structure. This way, it is also possible to avoid the open-coded arithmetic in pointers. Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1] Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2] Signed-off-by: Erick Archer <erick.archer@gmx.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240208181318.4259-1-erick.archer@gmx.com
-
- 24 Jan, 2024 2 commits
-
-
Juan Escamilla authored
The sysfs file is named 'enabled', thus users might want to know the true state of the RC6 instead of only the indication if the RC6 should be enabled. Let's use rc6.enable directly instead of rc6.supported. Signed-off-by: Juan Escamilla <jcescami@wasd.net> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240116172922.3460695-1-jcescami@wasd.net
-
Vinay Belgaumkar authored
Instead of waiting until the interrupt reaches GuC, we can grab a forcewake while triggering the H2G interrupt. GEN11_GUC_HOST_INTERRUPT is inside sgunit and is not affected by forcewakes. However, there could be some delays when platform is entering/exiting some higher level platform sleep states and a H2G is triggered. A forcewake ensures those sleep states have been fully exited and further processing occurs as expected. The hysteresis timers for C6 and higher sleep states will ensure there is no unwanted race between the wake and processing of the interrupts by GuC. This will have an official WA soon so adding a FIXME in the comments. v2: Make the new ranges watertight to address BAT failures and update commit message (Matt R). Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240119193513.221730-1-vinay.belgaumkar@intel.com
-
- 18 Jan, 2024 2 commits
-
-
Matt Roper authored
Some of our existing Xe_LPG workarounds and tuning are also applicable to the version 12.74 variant. Extend the condition bounds accordingly. Also fix the comment on Wa_14018575942 while we're at it. v2: Extend some more workarounds (Harish) Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Haridhar Kalvala <haridhar.kalvala@intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240108122738.14399-4-haridhar.kalvala@intel.com
-
Harish Chegondi authored
Xe_LPG+ (IP version 12.74) should take the same general code paths as Xe_LPG (versions 12.70 and 12.71). Xe_LPG+'s workaround list will be handled by the next patch. Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Haridhar Kalvala <haridhar.kalvala@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240108122738.14399-3-haridhar.kalvala@intel.com
-
- 10 Jan, 2024 1 commit
-
-
Juan Escamilla authored
Currently if rc6 is supported, it gets enabled and the sysfs files for rc6_enable_show and rc6_enable_dev_show uses masks to check information from drm_i915_private. However rc6_support functions take more variables and conditions into consideration and thus these masks are not enough for most of the modern hardware and it is simpley lyting to the user. Let's fix it by at least use the rc6.supported flag from intel_gt information. Signed-off-by: Juan Escamilla <jcescami@wasd.net> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240110010302.553597-1-jcescami@wasd.net
-
- 09 Jan, 2024 2 commits
-
-
John Harrison authored
Avoid the following lockdep complaint: <4> [298.856498] ====================================================== <4> [298.856500] WARNING: possible circular locking dependency detected <4> [298.856503] 6.7.0-rc5-CI_DRM_14017-g58ac4ffc75b6+ #1 Tainted: G N <4> [298.856505] ------------------------------------------------------ <4> [298.856507] kworker/4:1H/190 is trying to acquire lock: <4> [298.856509] ffff8881103e9978 (>->reset.backoff_srcu){++++}-{0:0}, at: _intel_gt_reset_lock+0x35/0x380 [i915] <4> [298.856661] but task is already holding lock: <4> [298.856663] ffffc900013f7e58 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x264/0x530 <4> [298.856671] which lock already depends on the new lock. The complaint is not actually valid. The busyness worker thread does indeed hold the worker lock and then attempt to acquire the reset lock (which may have happened in reverse order elsewhere). However, it does so with a trylock that exits if the reset lock is not available (specifically to prevent this and other similar deadlocks). Unfortunately, lockdep does not understand the trylock semantics (the lock is an i915 specific custom implementation for resets). Not doing a synchronous flush of the worker thread when a reset is in progress resolves the lockdep splat by never even attempting to grab the lock in this particular scenario. There are situatons where a synchronous cancel is required, however. So, always do the synchronous cancel if not in reset. And add an extra synchronous cancel to the end of the reset flow to account for when a reset is occurring at driver shutdown and the cancel is no longer synchronous but could lead to unallocated memory accesses if the worker is not stopped. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20231219195957.212600-1-John.C.Harrison@Intel.com
-
Alan Previn authored
If we are at the end of suspend or very early in resume its possible an async fence signal (via rcu_call) is triggered to free_engines which could lead us to the execution of the context destruction worker (after a prior worker flush). Thus, when suspending, insert rcu_barriers at the start of i915_gem_suspend (part of driver's suspend prepare) and again in i915_gem_suspend_late so that all such cases have completed and context destruction list isn't missing anything. In destroyed_worker_func, close the race against CT-loss by checking that CT is enabled before calling into deregister_destroyed_contexts. Based on testing, guc_lrc_desc_unpin may still race and fail as we traverse the GuC's context-destroy list because the CT could be disabled right before calling GuC's CT send function. We've witnessed this race condition once every ~6000-8000 suspend-resume cycles while ensuring workloads that render something onscreen is continuously started just before we suspend (and the workload is small enough to complete and trigger the queued engine/context free-up either very late in suspend or very early in resume). In such a case, we need to unroll the entire process because guc-lrc-unpin takes a gt wakeref which only gets released in the G2H IRQ reply that never comes through in this corner case. Without the unroll, the taken wakeref is leaked and will cascade into a kernel hang later at the tail end of suspend in this function: intel_wakeref_wait_for_idle(>->wakeref) (called by) - intel_gt_pm_wait_for_idle (called by) - wait_for_suspend Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_- contexts if guc_lrc_desc_unpin fails due to CT send falure. When unrolling, keep the context in the GuC's destroy-list so it can get picked up on the next destroy worker invocation (if suspend aborted) or get fully purged as part of a GuC sanitization (end of suspend) or a reset flow. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Tested-by: Mousumi Jana <mousumi.jana@intel.com> Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231229215143.581619-1-alan.previn.teres.alexis@intel.com
-