An error occurred fetching the project authors.
- 07 Jan, 2022 1 commit
-
-
Mario Limonciello authored
This codepath should be running in both s0ix and s3, but only does currently because s3 and s0ix are both set in the s0ix case. Signed-off-by:
Mario Limonciello <mario.limonciello@amd.com> Acked-by:
Evan Quan <evan.quan@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 30 Dec, 2021 3 commits
-
-
Alex Deucher authored
Chips with no display hardware should return false for DC support. v2: drop Arcturus and Aldebaran Fixes: f7f12b25 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support") Reviewed-by:
Evan Quan <evan.quan@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Reported-by:
Tareque Md.Hanif <tarequemd.hanif@yahoo.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Surbhi Kakarya authored
If the event guard is enabled and VF doesn't receive an ack from PF for full access, the guest driver load crashes. This is caused due to the call to ttm_device_clear_dma_mappings with non-initialized mman during driver tear down. This patch adds the necessary condition to check if the mman initialization passed or not and takes the path based on the condition output. Signed-off-by:
Surbhi Kakarya <Surbhi.Kakarya@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Leslie Shi authored
drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure [Why] In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well to release mapped VRAM. However, in the following release callback, driver stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs. [How] call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in vcn_v3_0_sw_fini() when GPU initialization failure. Signed-off-by:
Leslie Shi <Yuliang.Shi@amd.com> Reviewed-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 28 Dec, 2021 2 commits
-
-
sashank saye authored
For Aldebaran chip passthrough case we need to intimate SMU about special handling for SBR.On older chips we send LightSBR to SMU, enabling the same for Aldebaran. Slight difference, compared to previous chips, is on Aldebaran, SMU would do a heavy reset on SBR. Hence, the word Heavy instead of Light SBR is used for SMU to differentiate. Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by:
sashank saye <sashank.saye@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
Driver needs to call get_xgmi_info() before ip_init to determine whether it needs to handle a pending hive reset. Signed-off-by:
Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by:
David Nieto <david.nieto@amd.com> Reviewed by: shaoyun.liu <Shaoyun.lui@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 16 Dec, 2021 2 commits
-
-
Victor Skvortsov authored
We want to be able to call virt data exchange conditionally after gmc sw init to reserve bad pages as early as possible. Since this is a conditional call, we will need to call it again unconditionally later in the init sequence. Refactor the data exchange function so it can be called multiple times without re-initializing the work item. v2: Cleaned up the code. Kept the original call to init_exchange_data() inside early init to initialize the work item, afterwards call exchange_data() when needed. Signed-off-by:
Victor Skvortsov <victor.skvortsov@amd.com> Reviewed By: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Huang Rui authored
The job embedded fence donesn't initialize the flags at dma_fence_init(). Then we will go a wrong way in amdgpu_fence_get_timeline_name callback and trigger a null pointer panic once we enabled the trace event here. So introduce new amdgpu_fence object to indicate the job embedded fence. [ 156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0 [ 156.131804] #PF: supervisor read access in kernel mode [ 156.131811] #PF: error_code(0x0000) - not-present page [ 156.131817] PGD 0 P4D 0 [ 156.131824] Oops: 0000 [#1] PREEMPT SMP PTI [ 156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G OE 5.16.0-rc1-custom #1 [ 156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016 [ 156.131848] RIP: 0010:strlen+0x0/0x20 [ 156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 [ 156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206 [ 156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b [ 156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0 [ 156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000 [ 156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0 [ 156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007 [ 156.131914] FS: 0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000 [ 156.131923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0 [ 156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 156.131949] Call Trace: [ 156.131953] <TASK> [ 156.131957] ? trace_event_raw_event_dma_fence+0xcc/0x200 [ 156.131973] ? ring_buffer_unlock_commit+0x23/0x130 [ 156.131982] dma_fence_init+0x92/0xb0 [ 156.131993] amdgpu_fence_emit+0x10d/0x2b0 [amdgpu] [ 156.132302] amdgpu_ib_schedule+0x2f9/0x580 [amdgpu] [ 156.132586] amdgpu_job_run+0xed/0x220 [amdgpu] v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot) Signed-off-by:
Huang Rui <ray.huang@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 14 Dec, 2021 2 commits
-
-
Yann Dirson authored
Those are not today pulled by the sphinx doc, but better be ready. Signed-off-by:
Yann Dirson <ydirson@free.fr> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Guchun Chen authored
Updated for consistency when accessing drm_device from amdgpu driver. Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 13 Dec, 2021 6 commits
-
-
Philip Yang authored
If host and amdgpu IOMMU is not enabled or IOMMU is pass through mode, set adev->ram_is_direct_mapped flag which will be used to optimize memory usage for multi GPU mappings. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Lang Yu authored
It is useful to maintain error context when debugging SW/FW issues. Introduce amdgpu_device_halt() for this purpose. It will bring hardware to a kind of halt state, so that no one can touch it any more. Compare to a simple hang, the system will keep stable at least for SSH access. Then it should be trivial to inspect the hardware state and see what's going on. v2: - Set adev->no_hw_access earlier to avoid potential crashes.(Christian) Suggested-by:
Christian Koenig <christian.koenig@amd.com> Suggested-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Lang Yu <lang.yu@amd.com> Reviewed-by:
Christian Koenig <christian.koenig@amd.co> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Lang Yu authored
We found some headaches on ASICs don't need that, so remove that for them. Suggested-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Lang Yu <lang.yu@amd.com> Reviewed-by:
Kevin Wang <kevinyang.wang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Isabella Basso authored
This fixes various warnings relating to erroneous docstring syntax, of which some are listed below: warning: Function parameter or member 'adev' not described in 'amdgpu_atomfirmware_ras_rom_addr' ... warning: expecting prototype for amdgpu_atpx_validate_functions(). Prototype was for amdgpu_atpx_validate() instead ... warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new' ... warning: Cannot understand * @kfd_get_cu_occupancy - Collect number of waves in-flight on this device ... warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Signed-off-by:
Isabella Basso <isabbasso@riseup.net> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Zhigang Luo authored
For SRIOV VF, the XGMI topology was not recovered after reset. This change added code to SRIOV VF reset function to update XGMI topology for SRIOV VF after reset. Signed-off-by:
Zhigang Luo <zhigang.luo@amd.com> Reviewed-by:
Shaoyun Liu <shaoyun.liu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Zhigang Luo authored
On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler. This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF. Signed-off-by:
Zhigang Luo <zhigang.luo@amd.com> Reviewed-by:
Shaoyun Liu <shaoyun.liu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 01 Dec, 2021 5 commits
-
-
shaoyunl authored
This change revert previous commits: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function") This change moves the amdgpu_amdkfd_pre_reset to an earlier place in amdgpu_device_reset_sriov, presumably to address the sequence issue that the first patch was originally meant to fix. Some register access(GRBM_GFX_CNTL) only be allowed on full access mode. Move kfd_pre_reset and kfd_post_reset back inside reset_sriov function. Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") Fixes: 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function") Signed-off-by:
shaoyunl <shaoyun.liu@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Flora Cui authored
since vkms support atomic KMS interface Signed-off-by:
Flora Cui <flora.cui@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Acked-by:
Alex Deucher <aleander.deucher@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
shaoyunl authored
This change revert previous commits: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function") This change moves the amdgpu_amdkfd_pre_reset to an earlier place in amdgpu_device_reset_sriov, presumably to address the sequence issue that the first patch was originally meant to fix. Some register access(GRBM_GFX_CNTL) only be allowed on full access mode. Move kfd_pre_reset and kfd_post_reset back inside reset_sriov function. Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") Fixes: 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function") Signed-off-by:
shaoyunl <shaoyun.liu@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Stanley.Yang authored
v2: still need call ras_disable_all_featrures to handle ras initilization failure case. Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw, so ras ta will unload before send ras disable command, ras dsiable operation must before hw fini. Signed-off-by:
Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Flora Cui authored
since vkms support atomic KMS interface Signed-off-by:
Flora Cui <flora.cui@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Acked-by:
Alex Deucher <aleander.deucher@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 24 Nov, 2021 2 commits
-
-
shaoyunl authored
Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") For sriov XGMI configuration, the host driver will handle the hive reset, so in guest side, the reset_sriov only be called once on one device. This will make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov function to make them balance . Signed-off-by:
shaoyunl <shaoyun.liu@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
shaoyunl authored
Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") For sriov XGMI configuration, the host driver will handle the hive reset, so in guest side, the reset_sriov only be called once on one device. This will make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov function to make them balance . Signed-off-by:
shaoyunl <shaoyun.liu@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 22 Nov, 2021 1 commit
-
-
Evan Quan authored
Just bail out if the target IP block is already in the desired powergate/ungate state. This can avoid some duplicate settings which sometimes may cause unexpected issues. Link: https://lore.kernel.org/all/YV81vidWQLWvATMM@zn.tnic/ Bug: https://bugzilla.kernel.org/show_bug.cgi?id=214921 Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215025 Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1789Signed-off-by:
Evan Quan <evan.quan@amd.com> Tested-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 17 Nov, 2021 2 commits
-
-
Evan Quan authored
Just bail out if the target IP block is already in the desired powergate/ungate state. This can avoid some duplicate settings which sometimes may cause unexpected issues. Link: https://lore.kernel.org/all/YV81vidWQLWvATMM@zn.tnic/ Bug: https://bugzilla.kernel.org/show_bug.cgi?id=214921 Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215025 Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1789 Fixes: bf756fb8 ("drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend") Signed-off-by:
Evan Quan <evan.quan@amd.com> Tested-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
Evan Quan authored
With the shadow buffer support from generic framebuffer emulation, it's possible now to have runpm kicked when no update for console. Signed-off-by:
Evan Quan <evan.quan@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 09 Nov, 2021 1 commit
-
-
shaoyunl authored
The KFD pre_reset should be called before reset been executed, it will hold the lock to prevent other rocm process to sent the packlage to hiq during host execute the real reset on the HW Signed-off-by:
shaoyunl <shaoyun.liu@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 05 Nov, 2021 1 commit
-
-
Alex Deucher authored
Properly handle SI DC support when CONFIG_DRM_AMD_DC_SI is not set. Fixes: f7f12b25 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support") Reviewed-by:
Evan Quan <evan.quan@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 03 Nov, 2021 2 commits
-
-
James Zhu authored
Remove duplicated kfd_resume_iommu which already runs in mdgpu_amdkfd_device_init. Tested-By:
Ken Moffat <zarniwhoop@ntlworld.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
James Zhu <James.Zhu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Jingwen Chen authored
[Why] In advance tdr mode, the real bad job will be resubmitted twice, while in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job is put one more time than other jobs. [How] Adding dma_fence_get before resbumit job in amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs Signed-off-by:
Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 28 Oct, 2021 1 commit
-
-
Lang Yu authored
amdgpu_fence_driver_sw_fini() should be executed before amdgpu_device_ip_fini(), otherwise fence driver resource won't be properly freed as adev->rings have been tore down. Fixes: 72c8c97b ("drm/amdgpu: Split amdgpu_device_fini into early and late") Signed-off-by:
Lang Yu <lang.yu@amd.com> Reviewed-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 19 Oct, 2021 1 commit
-
-
YuBiao Wang authored
[Why] drm_irq_uninstall is called in irq_fini_hw so that irq is disabled in sw stage. SMU (and maybe other IP blocks) fini_hw will call irq_put for cleanup and the whole cleanup process will be skipped because of drm->irq_enable = false. [How] Move ip_fini_early before irq_fini_hw. Signed-off-by:
YuBiao Wang <YuBiao.Wang@amd.com> Reviewed-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 13 Oct, 2021 1 commit
-
-
Yifan Zhang authored
When IOMMU disabled in sbios and kfd in iommuv2 path, iommuv2 init will fail. But this failure should not block amdgpu driver init. Reported-by:
youling <youling257@gmail.com> Tested-by:
youling <youling257@gmail.com> Signed-off-by:
Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by:
James Zhu <James.Zhu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 08 Oct, 2021 1 commit
-
-
Guchun Chen authored
adev_to_drm is used everywhere, so improve recent changes when accessing drm_device pointer from amdgpu_device. Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 07 Oct, 2021 1 commit
-
-
Nirmoy Das authored
Unify BO evicting functionality for possible memory types in amdgpu_ttm.c. Signed-off-by:
Nirmoy Das <nirmoy.das@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 05 Oct, 2021 5 commits
-
-
Guchun Chen authored
In current code, when a PCI error state pci_channel_io_normal is detectd, it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI driver will continue the execution of PCI resume callback report_resume by pci_walk_bridge, and the callback will go into amdgpu_pci_resume finally, where write lock is releasd unconditionally without acquiring such lock first. In this case, a deadlock will happen when other threads start to acquire the read lock. To fix this, add a member in amdgpu_device strucutre to cache pci_channel_state, and only continue the execution in amdgpu_pci_resume when it's pci_channel_io_frozen. Fixes: c9a6b82f ("drm/amdgpu: Implement DPC recovery") Suggested-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Yifan Zhang authored
This patch is to fix clinfo failure in Raven/Picasso: Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.2 AMD-APP (3364.0) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback Platform Name: AMD Accelerated Parallel Processing Number of devices: 0 Signed-off-by:
Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by:
James Zhu <James.Zhu@amd.com> Tested-by:
James Zhu <James.Zhu@amd.com> Acked-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Guchun Chen authored
In current code, when a PCI error state pci_channel_io_normal is detectd, it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI driver will continue the execution of PCI resume callback report_resume by pci_walk_bridge, and the callback will go into amdgpu_pci_resume finally, where write lock is releasd unconditionally without acquiring such lock first. In this case, a deadlock will happen when other threads start to acquire the read lock. To fix this, add a member in amdgpu_device strucutre to cache pci_channel_state, and only continue the execution in amdgpu_pci_resume when it's pci_channel_io_frozen. Fixes: c9a6b82f ("drm/amdgpu: Implement DPC recovery") Suggested-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
Make sure that we notice this in error reports. Signed-off-by:
Christian König <christian.koenig@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
This reverts commit 728e7e0c. Further discussion reveals that this feature is severely broken and needs to be reverted ASAP. GPU reset can never be delayed by userspace even for debugging or otherwise we can run into in kernel deadlocks. Signed-off-by:
Christian König <christian.koenig@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Acked-by:
Nirmoy Das <nirmoy.das@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-