Commits · a323782567812ee925e9b7926445532c7afe331b · Kirill Smelkov / linux

23 Aug, 2024 4 commits

drm/xe: Drop warn on xe_guc_pc_gucrc_disable in guc pc fini · a3237825

Matthew Brost authored Aug 20, 2024

Not a big deal if CT is down as driver is unloading, no need to warn.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jagmeet Randhawa <jagmeet.randhawa@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240820172958.1095143-4-matthew.brost@intel.com

a3237825

drm/xe: Set firmware state to loadable before registering guc_fini_hw · b5de6a5c

Matthew Brost authored Aug 20, 2024

The guc_fini_hw registered calls __xe_uc_fw_status which is only
expected to be called after initializing fw state. Move this before
registering guc_fini_hw.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240820172958.1095143-3-matthew.brost@intel.com

b5de6a5c

drm/xe: Move ggtt_fini to devm managed · 5b993d00

Matthew Brost authored Aug 20, 2024

ggtt->scratch is destroyed via devm, ggtt_fini sets ggtt->scratch to
NULL, ggtt->scratch in GGTT clears, so ensure ggtt->scratch is set NULL
before the BO is destroyed.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240820172958.1095143-2-matthew.brost@intel.com

5b993d00

Revert "drm/xe: Invalidate media_gt TLBs in PT code" · 25ebe10e

Matthew Brost authored Aug 23, 2024

This reverts commit 40520283.

We can't install dma-fence-chain in timeline sync objs.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240823162207.2168887-1-matthew.brost@intel.com

25ebe10e

22 Aug, 2024 14 commits

drm/xe: Fix missing runtime outer protection for ggtt_remove_node · 919bb54e

Rodrigo Vivi authored Aug 21, 2024

Defer the ggtt node removal to a thread if runtime_pm is not active.

The ggtt node removal can be called from multiple places, including
places where we cannot protect with outer callers and places we are
within other locks. So, try to grab the runtime reference if the
device is already active, otherwise defer the removal to a separate
thread from where we are sure we can wake the device up.

v2: - use xe wq instead of system wq (Matt and CI)
    - Avoid GFP_KERNEL to be future proof since this removal can
    be called from outside our drivers and we don't want to block
    if atomic is needed. (Brost)
v3: amend forgot chunk declaring xe_device.
v4: Use a xe_ggtt_region to encapsulate the node and remova info,
    wihtout the need for any memory allocation at runtime.
v5: Actually fill the delayed_removal.invalidate (Brost)
v6: - Ensure that ggtt_region is not freed before work finishes (Auld)
    - Own wq to ensures that the queued works are flushed before
      ggtt_fini (Brost)
v7: also free ggtt_region on early !bound return (Auld)
v8: Address the null deref (CI)
v9: Based on the new xe_ggtt_node for the proper care of the lifetime
    of the object.
v10: Redo the lost v5 change. (Brost)
v11: Simplify the invalidate_on_remove (Lucas)

Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-12-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

919bb54e

drm/xe: Make xe_ggtt_node struct independent · 34e80422

Rodrigo Vivi authored Aug 21, 2024

In some rare cases, the drm_mm node cannot be removed synchronously
due to runtime PM conditions. In this situation, the node removal will
be delegated to a workqueue that will be able to wake up the device
before removing the node.

However, in this situation, the lifetime of the xe_ggtt_node cannot
be restricted to the lifetime of the parent object. So, this patch
introduces the infrastructure so the xe_ggtt_node struct can be
allocated in advance and freed when needed.

By having the ggtt backpointer, it also ensure that the init function
is always called before any attempt to insert or reserve the node
in the GGTT.

v2: s/xe_ggtt_node_force_fini/xe_ggtt_node_fini and use it
    internaly (Brost)
v3: - Use GF_NOFS for node allocation (CI)
    - Avoid ggtt argument, now that we have it inside the node (Lucas)
    - Fix some missed fini cases (CI)
v4: - Fix SRIOV critical case where config->ggtt_region was
      lost (Michal)
    - Avoid ggtt argument also on removal (missed case on v3) (Michal)
    - Remove useless checks (Michal)
    - Return 0 instead of negative errno on a u32 addr. (Michal)
    - s/xe_ggtt_assign/xe_ggtt_node_assign for coherence, while we
      are touching it (Michal)
v5: - Fix VFs' ggtt_balloon

Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-11-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

34e80422

drm/xe: Refactor xe_ggtt balloon functions to make the node clear · 15ca0949

Rodrigo Vivi authored Aug 21, 2024

These operations are related to node. Convert them to the
new appropriate name space xe_ggtt_node.

v2: Also move arguments around for consistency (Lucas).
v3: s/node_balloon/node_insert_balloon and
s/node_deballoon/node_remove_balloon (Michal).
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-10-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

15ca0949

drm/xe: Introduce xe_ggtt_print_holes · 13636729

Rodrigo Vivi authored Aug 21, 2024

Introduce a new xe_ggtt_print_holes helper that attends the SRIOV
demand and finishes the goal of limiting drm_mm access to xe_ggtt.

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-9-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

13636729

drm/xe: Introduce xe_ggtt_largest_hole · 1144e0df

Rodrigo Vivi authored Aug 21, 2024

Introduce a new xe_ggtt_largest_hole helper that attends the SRIOV
demand and continue with the goal of limiting drm_mm access to xe_ggtt.

v2: Fix a typo (Michal)

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-8-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

1144e0df

drm/xe: Limit drm_mm_node_allocated access to xe_ggtt_node · 8b5ccc97

Rodrigo Vivi authored Aug 21, 2024

Continue with the encapsulation of drm_mm_node inside xe_ggtt.

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-7-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

8b5ccc97

drm/xe: Rename xe_ggtt_node related functions · 0567f18e

Rodrigo Vivi authored Aug 21, 2024

Bring some consistency and prepare for more xe_ggtt_node related
functions to be introduced.
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-6-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

0567f18e

drm/xe: Encapsulate drm_mm_node inside xe_ggtt_node · 6062ea93

Rodrigo Vivi authored Aug 21, 2024

The xe_ggtt component uses drm_mm to manage the GGTT.
The drm_mm_node is just a node inside drm_mm, but in Xe we use that
only in the GGTT context. So, this patch encapsulates the drm_mm_node
into a xe_ggtt's new struct.

This is the first step towards limiting all the drm_mm access
through xe_ggtt. The ultimate goal is to have a better control of
the node insertion and removal, so the removal can be delegated
to a delayed workqueue.

v2: Fix includes and typos (Michal and Brost)
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-5-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

6062ea93

drm/{i915, xe}: Avoid direct inspection of dpt_vma from outside dpt · 6dbd43dc

Rodrigo Vivi authored Aug 21, 2024

DPT code is so dependent on i915 vma implementation and it is not
ported yet to Xe.

This patch limits inspection to DPT's VMA struct to intel_dpt
component only, so the Xe GGTT code can evolve.

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Juha-Pekka Heikkila <juhapekka.heikkila@gmail.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-4-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

6dbd43dc

drm/xe: Remove unnecessary drm_mm.h includes · df99acc7

Rodrigo Vivi authored Aug 21, 2024

These includes are no longer necessary, and where appropriate
are replaced by the linux/types.h one.
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-3-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

df99acc7

drm/xe: Introduce GGTT documentation · 244fe166

Rodrigo Vivi authored Aug 21, 2024

Document xe_ggtt and ensure it is part of the built kernel docs.

v2: - Accepted all Michal's suggestions
    - Rebased on top of new set_pte per platform/wa function pointer
v3: - Typos and other acronym fixes (Michal)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> #v1
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-2-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

244fe166

drm/xe: Removed unused xe_ggtt_printk · 69f0925c

Rodrigo Vivi authored Aug 21, 2024

Apparently this was only useful when enabling ggtt support
for the very first time and never used again.
It is also not useful now that we have the ggtt_dump available
through debugfs.
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821193842.352557-1-rodrigo.vivi@intel.comSigned-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

69f0925c

drm/xe: fixup xe_alloc_pf_queue · 321d6b4b

Matthew Auld authored Aug 21, 2024

kzalloc expects number of bytes, therefore we should convert the number
of dw into bytes, otherwise we are likely just accessing beyond the
array causing all kinds of carnage. Also fixup the error handling while
we are here.

v2:
 - Prefer kcalloc (dim)

Fixes: 3338e4f9 ("drm/xe: Use topology to determine page fault queue size")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240821171917.417386-2-matthew.auld@intel.com

321d6b4b

drm/xe: Invalidate media_gt TLBs in PT code · 40520283

Matthew Brost authored Aug 20, 2024

Testing on LNL has shown media GT's TLBs need to be invalidated via the
GuC, update PT code appropriately.

v2:
 - Do dma_fence_get before first call of invalidation_fence_init (Himal)
 - No need to check for valid chain fence (Himal)

Fixes: 33303615 ("drm/xe/lnl: Add LNL platform definition")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240820161632.987369-1-matthew.brost@intel.com

40520283

21 Aug, 2024 2 commits

drm/xe: Invalidate media_gt TLBs · 77cc3f6c

Matthew Brost authored Aug 20, 2024

Testing on LNL has shown media TLBs need to be invalidated via the GuC,
update xe_vm_invalidate_vma appropriately.

v2: Fix 2 tile case
v3: Include missing local change

Fixes: 33303615 ("drm/xe/lnl: Add LNL platform definition")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240820160129.986889-1-matthew.brost@intel.com

77cc3f6c

drm/xe: Free job before xe_exec_queue_put · 32a42c93

Matthew Brost authored Aug 20, 2024

Free job depends on job->vm being valid, the last xe_exec_queue_put can
destroy the VM. Prevent UAF by freeing job before xe_exec_queue_put.

Fixes: dd08ebf6 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Jagmeet Randhawa <jagmeet.randhawa@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240820202309.1260755-1-matthew.brost@intel.com

32a42c93

20 Aug, 2024 4 commits

drm/xe: Drop HW fence pointer to HW fence ctx · 60db6f54

Matthew Brost authored Aug 15, 2024

The HW fence ctx objects are not ref counted rather tied to the life of
an LRC object. HW fences reference the HW fence ctx, HW fences can
outlive LRCs thus resulting in UAF. Drop the  HW fence pointer to HW
fence ctx rather just store what is needed directly in HW fence.

v2:
 - Fix typo in commit (Ashutosh)
 - Use snprintf (Ashutosh)

Fixes: dd08ebf6 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240815193522.16008-1-matthew.brost@intel.com

60db6f54

drm/xe/guc: Bump the G2H queue size to account for page faults · df2dbc92

Stuart Summers authored Aug 17, 2024

With the increase in the size of the recoverable page fault
queue, we want to ensure the initial messages from GuC in
the G2H buffer have space while we transfer those out to the
actual pf_queue. Bump the G2H queue size to account for this
increase in the pf_queue size.
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/4c2b6974801bcffd8a010d838c8733fa4092573d.1723862633.git.stuart.summers@intel.com

df2dbc92

drm/xe: Use topology to determine page fault queue size · 3338e4f9

Stuart Summers authored Aug 17, 2024

Currently the page fault queue size is hard coded. However
the hardware supports faulting for each EU and each CS.
For some applications running on hardware with a large
number of EUs and CSs, this can result in an overflow of
the page fault queue.

Add a small calculation to determine the page fault queue
size based on the number of EUs and CSs in the platform as
detmined by fuses.
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/24d582a3b48c97793b8b6a402f34b4b469471636.1723862633.git.stuart.summers@intel.com

3338e4f9

drm/xe: Fix missing workqueue destroy in xe_gt_pagefault · 7586fc52

Stuart Summers authored Aug 17, 2024

On driver reload we never free up the memory for the pagefault and
access counter workqueues. Add those destroy calls here.

Fixes: dd08ebf6 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/c9a951505271dc3a7aee76de7656679f69c11518.1723862633.git.stuart.summers@intel.com

7586fc52

19 Aug, 2024 7 commits

drm/xe/lnl: Offload system clear page activity to GPU · 23683061

Nirmoy Das authored Aug 16, 2024

On LNL because of flat CCS, driver creates migrates job to clear
CCS meta data. Extend that to also clear system pages using GPU.
Inform TTM to allocate pages without __GFP_ZERO to avoid double page
clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set
TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's clear
on free as XE now takes care of clearing pages. If a bo is in system
placement such as BO created with  DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING
and there is a cpu map then for such BO gpu clear will be avoided as
there is no dma mapping for such BO at that moment to create migration
jobs.

Tested this patch api_overhead_benchmark_l0 from
https://github.com/intel/compute-benchmarks

Without the patch:
api_overhead_benchmark_l0 --testFilter=UsmMemoryAllocation:
UsmMemoryAllocation(api=l0 type=Host size=4KB) 84.206 us
UsmMemoryAllocation(api=l0 type=Host size=1GB) 105775.56 us
erf tool top 5 entries:
71.44% api_overhead_be  [kernel.kallsyms]   [k] clear_page_erms
6.34%  api_overhead_be  [kernel.kallsyms]   [k] __pageblock_pfn_to_page
2.24%  api_overhead_be  [kernel.kallsyms]   [k] cpa_flush
2.15%  api_overhead_be  [kernel.kallsyms]   [k] pages_are_mergeable
1.94%  api_overhead_be  [kernel.kallsyms]   [k] find_next_iomem_res

With the patch:
api_overhead_benchmark_l0 --testFilter=UsmMemoryAllocation:
UsmMemoryAllocation(api=l0 type=Host size=4KB) 79.439 us
UsmMemoryAllocation(api=l0 type=Host size=1GB) 98677.75 us
Perf tool top 5 entries:
11.16% api_overhead_be  [kernel.kallsyms]   [k] __pageblock_pfn_to_page
7.85%  api_overhead_be  [kernel.kallsyms]   [k] cpa_flush
7.59%  api_overhead_be  [kernel.kallsyms]   [k] find_next_iomem_res
7.24%  api_overhead_be  [kernel.kallsyms]   [k] pages_are_mergeable
5.53%  api_overhead_be  [kernel.kallsyms]   [k] lookup_address_in_pgd_attr

Without this patch clear_page_erms() dominates execution time which is
also not pipelined with migration jobs. With this patch page clearing
will get pipelined with migration job and will free CPU for more work.

v2: Handle regression on dgfx(Himal)
    Update commit message as no ttm API changes needed.
v3: Fix Kunit test.
v4: handle data leak on cpu mmap(Thomas)
v5: s/gpu_page_clear/gpu_page_clear_sys and move setting
    it to xe_ttm_sys_mgr_init() and other nits (Matt Auld)
v6: Disable it when init_on_alloc and/or init_on_free is active(Matt)
    Use compute-benchmarks as reporter used it to report this
    allocation latency issue also a proper test application than mime.
    In v5, the test showed significant reduction in alloc latency but
    that is not the case any more, I think this was mostly because
    previous test was done on IFWI which had low mem BW from CPU.

Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240816135154.19678-2-nirmoy.das@intel.comSigned-off-by: Nirmoy Das <nirmoy.das@intel.com>

23683061

drm/ttm: Add a flag to allow drivers to skip clear-on-free · decbfaf0

Nirmoy Das authored Aug 16, 2024

Add TTM_TT_FLAG_CLEARED_ON_FREE, which DRM drivers can set before
releasing backing stores if they want to skip clear-on-free.

Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240816135154.19678-1-nirmoy.das@intel.comSigned-off-by: Nirmoy Das <nirmoy.das@intel.com>

decbfaf0

drm/xe/oa: Use vma_pages() helper function in xe_oa_mmap() · 6841a26e

Thorsten Blum authored Aug 19, 2024

Use the vma_pages() helper function and remove the following
Coccinelle/coccicheck warning reported by vma_pages.cocci:

WARNING: Consider using vma_pages helper on vma
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240819095751.539645-2-thorsten.blum@toblux.com

6841a26e

drm/xe/display: Make display suspend/resume work on discrete · cb8f81c1

Maarten Lankhorst authored Aug 06, 2024

We should unpin before evicting all memory, and repin after GT resume.
This way, we preserve the contents of the framebuffers, and won't hang
on resume due to migration engine not being restored yet.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Fixes: dd08ebf6 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org # v6.8+
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240806105044.596842-3-maarten.lankhorst@linux.intel.comSigned-off-by: Maarten Lankhorst,,, <maarten.lankhorst@linux.intel.com>

cb8f81c1

drm/xe/display: Match i915 driver suspend/resume sequences better · 492be2a0

Maarten Lankhorst authored Aug 06, 2024

Suspend fbdev sooner, and disable user access before suspending to
prevent some races. I've noticed this when comparing xe suspend to
i915's.

Matches the following commits from i915:
24b412b1 ("drm/i915: Disable intel HPD poll after DRM poll init/enable")
1ef28d86 ("drm/i915: Suspend the framebuffer console earlier during system suspend")
bd738d85 ("drm/i915: Prevent modesets during driver init/shutdown")

Thanks to Imre for pointing me to those commits.

Driver shutdown is currently missing, but I have some idea how to
implement it next.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240806105044.596842-2-maarten.lankhorst@linux.intel.comSigned-off-by: Maarten Lankhorst,,, <maarten.lankhorst@linux.intel.com>

492be2a0

drm/xe: prevent UAF around preempt fence · 7116c35a

Matthew Auld authored Aug 14, 2024

The fence lock is part of the queue, therefore in the current design
anything locking the fence should then also hold a ref to the queue to
prevent the queue from being freed.

However, currently it looks like we signal the fence and then drop the
queue ref, but if something is waiting on the fence, the waiter is
kicked to wake up at some later point, where upon waking up it first
grabs the lock before checking the fence state. But if we have already
dropped the queue ref, then the lock might already be freed as part of
the queue, leading to uaf.

To prevent this, move the fence lock into the fence itself so we don't
run into lifetime issues. Alternative might be to have device level
lock, or only release the queue in the fence release callback, however
that might require pushing to another worker to avoid locking issues.

Fixes: dd08ebf6 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2454
References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2342
References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2020Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240814110129.825847-2-matthew.auld@intel.com

7116c35a

drm/xe: Remove redundant param from xe_bo_create_user · 6b77dab5

Nirmoy Das authored Aug 16, 2024

BO from xe_bo_create_user() will always be of type,
ttm_bo_type_device. So remove that redundant parameter.

Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240816102248.25628-1-nirmoy.das@intel.comSigned-off-by: Nirmoy Das <nirmoy.das@intel.com>

6b77dab5

18 Aug, 2024 9 commits

drm/xe/device: Remove unused xe_device::usm::num_vm_in_* · 4099cfda

Francois Dugast authored Aug 09, 2024

Those counters were used to keep track of the numbers VMs in fault mode
and in non-fault mode, to determine if the whole device was in fault mode
or not. This is no longer needed so remove those variables and their
usages.
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-12-francois.dugast@intel.com

4099cfda

drm/xe/vm: Remove restriction that all VMs must be faulting if one is · 226d92e4

Francois Dugast authored Aug 09, 2024

With this restriction, all VMs on the device must be faulting VMs if there
is already one faulting VM, in which case the device is considered in
fault mode. This prevents for example an application from running 3D jobs
for the compositor while submitting a SVM compute job on the same device.

Now that mutual exclusion of faulting LR jobs and dma fence jobs is
ensured on the hw engine group, remove this restriction to allow running
faulting and non-faulting VMs on the same device.
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-11-francois.dugast@intel.com

226d92e4

drm/xe/exec: Switch hw engine group execution mode upon job submission · d16ef1a1

Francois Dugast authored Aug 09, 2024

If the job about to be submitted is a dma-fence job, update the current
execution mode of the hw engine group. This triggers an immediate suspend
of the exec queues running faulting long-running jobs.

If the job about to be submitted is a long-running job, kick a new worker
used to resume the exec queues running faulting long-running jobs once
the dma-fence jobs have completed.

v2: Kick the resume worker from exec IOCTL, switch to unordered workqueue,
destroy it after use (Matt Brost)

v3: Do not resume if no exec queue was suspended (Matt Brost)

v4: Squash commits (Matt Brost)

v5: Do not kick the worker when xe_vm_in_preempt_fence_mode (Matt Brost)
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-10-francois.dugast@intel.com

d16ef1a1

drm/xe/hw_engine_group: Ensure safe transition between execution modes · 770bd1d3

Francois Dugast authored Aug 09, 2024

Provide a way to safely transition execution modes of the hw engine group
ahead of the actual execution. When necessary, either wait for running
jobs to complete or preempt them, thus ensuring mutual exclusion between
execution modes.

Unlike a mutex, the rw_semaphore used in this context allows multiple
submissions in the same mode.

v2: Use lockdep_assert_held_write, add annotations (Matt Brost)

v3: Fix kernel doc, remove redundant code (Matt Brost)

v4: Now that xe_hw_engine_group_suspend_faulting_lr_jobs can fail,
    propagate the error to the caller (Matt Brost)
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-9-francois.dugast@intel.com

770bd1d3

drm/xe/hw_engine_group: Add helper to wait for dma fence jobs · 2750ff97

Francois Dugast authored Aug 09, 2024

This is a required feature for faulting long running jobs not to be
submitted while dma fence jobs are running on the hw engine group.

v2: Switch to lockdep_assert_held_write in worker, get a proper reference
for the last fence (Matt Brost)

v3: Directly call dma_fence_put with the fence ref (Matt Brost)
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-8-francois.dugast@intel.com

2750ff97

drm/xe/exec_queue: Prepare last fence for hw engine group resume context · 0d92cd89

Francois Dugast authored Aug 09, 2024

Ensure we can safely take a ref of the exec queue's last fence from the
context of resuming jobs from the hw engine group. The locking requirements
differ from the general case, hence the introduction of this new function.

v2: Add kernel doc, rework the code to prevent code duplication

v3: Fix kernel doc, remove now unnecessary lockdep variants (Matt Brost)

v4: Remove new put function (Matt Brost)
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-7-francois.dugast@intel.com

0d92cd89

drm/xe/exec_queue: Remove duplicated code · 7f0d7bee

Francois Dugast authored Aug 09, 2024

This code section is the same as the body of
xe_exec_queue_last_fence_put_unlocked() so call the function instead and
remove duplicated code to make maintenance easier.
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-6-francois.dugast@intel.com

7f0d7bee

drm/xe/hw_engine_group: Add helper to suspend faulting LR jobs · 53fdfa19

Francois Dugast authored Aug 09, 2024

This is a required feature for dma fence jobs to preempt faulting long
running jobs in order to ensure mutual exclusion on a given hw engine
group.

v2: Pipeline calls to suspend(q) and suspend_wait(q) to improve
    efficiency, switch to lockdep_assert_held_write (Matt Brost)

v3: Return error on suspend_wait failure to propagate on the call stack
    up to IOCTL (Matt Brost)
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-5-francois.dugast@intel.com

53fdfa19

'drm/xe/hw_engine_group: Register hw engine group's exec queues · 7970cb36

Francois Dugast authored Aug 09, 2024

Add helpers to safely add and delete the exec queues attached to a hw
engine group, and make use them at the time of creation and destruction of
the exec queues. Keeping track of them is required to control the
execution mode of the hw engine group.

v2: Improve error handling and robustness, suspend exec queues created in
fault mode if group in dma-fence mode, init queue link (Matt Brost)

v3: Delete queue from hw engine group when it is destroyed by the user,
also clean up at the time of closing the file in case the user did
not destroy the queue

v4: Use correct list when checking if empty, do not add the queue if VM
is in xe_vm_in_preempt_fence_mode (Matt Brost)

v5: Remove unrelated newline, add checks and asserts for group, unwind on
suspend failure (Matt Brost)
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-4-francois.dugast@intel.com

7970cb36