Commits · 12c17b9d62663c14a5343d6742682b3e67280754 · nexedi / linux

22 Apr, 2020 16 commits

drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU · 12c17b9d

Guchun Chen authored Apr 16, 2020

When running ras uncorrectable error injection and triggering GPU
reset on sGPU, below issue is observed. It's caused by the list
uninitialized when accessing.

[   80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750
[   80.047300] #PF: supervisor write access in kernel mode
[   80.047351] #PF: error_code(0x0003) - permissions violation
[   80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061
[   80.047477] Oops: 0003 [#1] SMP PTI
[   80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G           OE     5.4.0-rc7-guchchen #1
[   80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
[   80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

12c17b9d

drm/amdgpu: Disable FRU read on Arcturus · 69d0c18d

Kent Russell authored Apr 16, 2020

Update the list with supported Arcturus chips, but disable for now until
final list is confirmed.

Ideally we can poll atombios for FRU support, instead of maintaining
this list of chips, but this will enable serial number reading for
supported ASICs for the time-being.
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

69d0c18d

drm/amd/powerplay: fix resume failed as smu table initialize early exit · 4e2fec33

Prike Liang authored Apr 15, 2020

When the amdgpu in the suspend/resume loop need notify the dpm disabled,
otherwise the smu table will be uninitialize and result in resume failed.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Tested-by: Mengbing Wang <Mengbing.Wang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4e2fec33

drm/amdgpu/gmc: Fix spelling mistake. · 53c9c89a

Rajneesh Bhardwaj authored Apr 05, 2020

Fixes a minor typo in the file.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

53c9c89a

drm/amdgpu: cache smu fw version info · e57761c6

John Clements authored Apr 15, 2020

reduce cmd submission to smu by caching version info
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e57761c6

Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" · fdd21e62

Kent Russell authored Apr 13, 2020

This reverts commit c12b84d6.

The original patch causes a RAS event and subsequent kernel hard-hang
when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
Arcturus

dmesg output at hang time:
[drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
amdgpu 0000:67:00.0: GPU reset begin!
Evicting PASID 0x8000 queues
Started evicting pasid 0x8000
qcm fence wait loop timeout expired
The cp might be in an unrecoverable state due to an unsuccessful queues preemption
Failed to evict process queues
Failed to suspend process 0x8000
Finished evicting pasid 0x8000
Started restoring pasid 0x8000
Finished restoring pasid 0x8000
[drm] UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT
amdgpu: [powerplay] Failed to send message 0x26, response 0x0
amdgpu: [powerplay] Failed to set soft min gfxclk !
amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
amdgpu: [powerplay] Failed to send message 0x7, response 0x0
amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu features!
amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -5
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

fdd21e62

drm/amdgpu/gfx9: add gfxoff quirk · 079c72ad

Alex Deucher authored Apr 09, 2020

Fix screen corruption with firefox.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=207171Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

079c72ad

drm/amdgpu: set mp1 state before reload · 7f70443f

John Clements authored Apr 14, 2020

Set MP1 state to prepare for unload before reloading SMU FW
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7f70443f

drm/amdgpu: update psp fw loading sequence · 40e611bd

John Clements authored Apr 14, 2020

Added dedicated function to check if particular fw should be skipped from loading.

Added dedicated function for SMU FW loading via PSP
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

40e611bd

drm/amd/powerplay: update Arcturus smu-driver if header · 47c11cff

Evan Quan authored Apr 13, 2020

To fit the latest PMFW.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

47c11cff

drm/amd/powerplay: properly set the dpm_enabled state · 774e335b

Evan Quan authored Apr 07, 2020

On the ASIC powered down(in baco or system suspend),
the dpm_enabled will be set as false. Then all access
(e.g. df state setting issued on RAS error event) to
SMU will be blocked.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

774e335b

drm/amd/powerplay: correct i2c eeprom init/fini sequence · 94e0805b

Evan Quan authored Apr 13, 2020

As data transfer may starts immediately after i2c eeprom init
completed. Thus i2c eeprom should be initialized after SMU
ready. And i2c data transfer should be prohibited when SMU
down. That is the i2c eeprom fini sequence needs to be
updated also.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

94e0805b

drm/amd/powerplay: bump the NAVI10 smu-driver if version · 56ddddaa

Evan Quan authored Mar 06, 2020

To fit the latest SMC firmware 42.53 and eliminate the
warning on driver loading.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

56ddddaa

drm/amd/powerplay: revise the way to retrieve the board parameters · 02c0bb4e

Evan Quan authored Mar 06, 2020

It can support different NV1x ASIC better. And this can guard
no member got missing.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

02c0bb4e

drm/amdgpu: fix the hw hang during perform system reboot and reset · ced1ba97

Prike Liang authored Apr 13, 2020

The system reboot failed as some IP blocks enter power gate before perform
hw resource destory. Meanwhile use unify interface to set device CGPG to ungate
state can simplify the amdgpu poweroff or reset ungate guard.

Fixes: 487eca11 ("drm/amdgpu: fix gfx hang during suspend with video playback (v2)")
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Tested-by: Mengbing Wang <Mengbing.Wang@amd.com>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ced1ba97

drm/amd/display: remove redundant assignment to variable dp_ref_clk_khz · 5edb7691

Colin Ian King authored Apr 10, 2020

The variable dp_ref_clk_khz is being initialized with a value that is
never read and it is being updated later with a new value.  The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5edb7691

13 Apr, 2020 19 commits

drm/radeon: remove defined but not used variables in ci_dpm.c · 43ad9b39

Jason Yan authored Apr 13, 2020

Fix the following gcc warning:

drivers/gpu/drm/radeon/ci_dpm.c:82:36: warning: ‘defaults_saturn_pro’
defined but not used [-Wunused-const-variable=]
 static const struct ci_pt_defaults defaults_saturn_pro =
                                    ^~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/radeon/ci_dpm.c:68:36: warning: ‘defaults_bonaire_pro’
defined but not used [-Wunused-const-variable=]
 static const struct ci_pt_defaults defaults_bonaire_pro =
                                    ^~~~~~~~~~~~~~~~~~~~
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

43ad9b39

drm/radeon: remove defined but not used 'dte_data_tahiti_le' · 01e5e998

Jason Yan authored Apr 13, 2020

Fix the following gcc warning:

drivers/gpu/drm/radeon/si_dpm.c:255:33: warning: ‘dte_data_tahiti_le’
defined but not used [-Wunused-const-variable=]
 static const struct si_dte_data dte_data_tahiti_le =
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

01e5e998

drm/amdgpu: remove dead code in si_dpm.c · 8e2f8420

Jason Yan authored Apr 13, 2020

This code is dead, let's remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8e2f8420

drm/amd/amdgpu: remove hardcoded module name in prints · dd4fa6c1

Aurabindo Pillai authored Apr 08, 2020

Let format prefixes take care of printing the module name
through pr_fmt and dev_fmt definitions.
Signed-off-by: Aurabindo Pillai <mail@aurabindo.in>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

dd4fa6c1

drm/amd/amdgpu: add print prefix for dev_* variants · 539489fc

Aurabindo Pillai authored Apr 08, 2020

Define dev_fmt macro for informative print messages
Signed-off-by: Aurabindo Pillai <mail@aurabindo.in>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

539489fc

drm/amd/amdgpu: add prefix for pr_* prints · d57229b1

Aurabindo Pillai authored Apr 08, 2020

amdgpu uses lots of pr_* calls for printing error messages.
With this prefix, errors shall be more obvious to the end
use regarding its origin, and may help debugging.

Prefix format:

[xxx.xxxxx] amdgpu: ...
Signed-off-by: Aurabindo Pillai <mail@aurabindo.in>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d57229b1

drm/amd/display: code clean up in dce80_hw_sequencer.c · bba8289b

Jason Yan authored Apr 13, 2020

Fix the following gcc warning:

drivers/gpu/drm/amd/amdgpu/../display/dc/dce80/dce80_hw_sequencer.c:43:46:
warning: ‘reg_offsets’ defined but not used [-Wunused-const-variable=]
 static const struct dce80_hw_seq_reg_offsets reg_offsets[] = {
                                              ^~~~~~~~~~~
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

bba8289b

drm/amdgpu/ring: simplify scheduler setup logic · a4c24680

Alex Deucher authored Apr 09, 2020

Set up a GPU scheduler based on the ring flag rather
than the ring type.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a4c24680

drm/amdgpu/kiq: add no_scheduler flag to KIQ · a783910d

Alex Deucher authored Apr 09, 2020

We don't want a GPU scheduler for this ring.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a783910d

drm/amdgpu/ring: add no_scheduler flag · cb3d1085

Alex Deucher authored Apr 09, 2020

This allows IPs to flag whether a specific ring requires
a GPU scheduler or not.  E.g., sometimes instances of an
IP are asymmetric and have different capabilities.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

cb3d1085

drm/amdgpu/powerplay: get SMC FW size to a flexible way · e8663832

Likun Gao authored Sep 16, 2019

Get SMC fw size before backdoor loading instead of giving an
certain value, as it may different for different ASIC.
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Kevin Wang <kevin1.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e8663832

drm/amdgpu: fix wrong vram lost counter increment V2 · dadce777

Evan Quan authored Apr 10, 2020

Vram lost counter is wrongly increased by two during baco reset.

V2: assumed vram lost for mode1 reset on all ASICs
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

dadce777

drm/amdgpu: replace DRM prefix with PCI device info for GFX RAS · ed72aa21

Guchun Chen authored Apr 13, 2020

Prefix RAS message printing in GFX IP with PCI device info,
which assists the debug in multiple GPU case.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ed72aa21

drm/amdgpu: resume kiq access debugfs · d32709da

Yintian Tao authored Apr 13, 2020

If there is no GPU hang, user still can access
debugfs through kiq.
Signed-off-by: Yintian Tao <yttao@amd.com>
Reviewed-by: Monk Liu <Monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d32709da

drm/amdgpu: refine ras related message print · 6952e99c

Guchun Chen authored Apr 10, 2020

Prefix ras related kernel message logging with PCI
device info by replacing DRM_INFO/WARN/ERROR with
dev_info/warn/err. This can clearly tell user about
GPU device information where ras is. And add some
other ras message printing to make it more clear
and friendly as well.
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

6952e99c

drm/amdgpu: add uncorrectable error count print in UMC ecc irq cb · 1f3ef0ef

Guchun Chen authored Apr 10, 2020

Uncorrectable error count printing is missed when issuing UMC
UE injection. When going to the error count log function in GPU
recover work thread, there is no chance to get correct error count
value by last error injection and print, because the error status
register is automatically cleared after reading in UMC ecc irq
callback. So add such message printing in UMC ecc irq cb to be
consistent with other RAS error interrupt cases.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1f3ef0ef

drm/amd/powerplay: force the trim of the mclk dpm_levels if OD is enabled · 7adf5619

Sergei Lopatin authored Jun 26, 2019

Should prevent flicker if PP_OVERDRIVE_MASK is set.

bug: https://bugs.freedesktop.org/show_bug.cgi?id=102646
bug: https://bugs.freedesktop.org/show_bug.cgi?id=108941
bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1088
bug: https://gitlab.freedesktop.org/drm/amd/-/issues/628Signed-off-by: Sergei Lopatin <magist3r@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7adf5619

drm/amd/display: Change "error" to "dc_log" at amdgpu_dm dpcd reading stage · f9135b08

Zhan Liu authored Apr 09, 2020

[Why]
If reading dpcd happens ahead of hw initialization, then aconnector is NULL
at this point. This is expected, so there is no need to output an error (which will
spam dmesg.log)

[How]
Change type of message from "error" to "DC_LOG_DC".
Signed-off-by: Zhan Liu <zhan.liu@amd.com>
Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Acked-by: Zhan Liu <zhan.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f9135b08

drm/amdgpu: restrict debugfs register access under SR-IOV · 95a2f917

Yintian Tao authored Apr 07, 2020

Under bare metal, there is no more else to take
care of the GPU register access through MMIO.
Under Virtualization, to access GPU register is
implemented through KIQ during run-time due to
world-switch.

Therefore, under SR-IOV user can only access
debugfs to r/w GPU registers when meets all
three conditions below.
- amdgpu_gpu_recovery=0
- TDR happened
- in_gpu_reset=0

v2: merge amdgpu_virt_can_access_debugfs() into
    amdgpu_virt_enable_access_debugfs()

v3: drop ret variable in amdgpu_virt_enable_access_debugfs()
    and directly return result
Signed-off-by: Yintian Tao <yttao@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

95a2f917

09 Apr, 2020 5 commits

drm/amdgpu: increased atom cmd timeout · 9a785c7a

John Clements authored Apr 09, 2020

added macro to define timeout
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9a785c7a

drm/amd/powerplay: unload mp1 for Arcturus RAS baco reset · 5aaa8fff

Evan Quan authored Mar 24, 2020

This sequence is recommended by PMFW team for the baco reset
with PMFW reloaded. And it seems able to address the random
failure seen on Arcturus.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5aaa8fff

amdgpu_kms: Remove unnecessary condition check · ad36d71b

Aurabindo Pillai authored Apr 07, 2020

Execution will only reach here if the asserted condition is true.
Hence there is no need for the additional check.
Signed-off-by: Aurabindo Pillai <mail@aurabindo.in>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ad36d71b

drm/amdgpu/display: fix warning when compiling without debugfs · ef91e8b5

Alex Deucher authored Apr 08, 2020

fixes unused variable warning.
Reported-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Mikita Lipski <mikita.lipski@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ef91e8b5

drm/amdgpu: unify fw_write_wait for new gfx9 asics · ba714a56

Aaron Liu authored Apr 07, 2020

Make the fw_write_wait default case true since presumably all new
gfx9 asics will have updated firmware. That is using unique WAIT_REG_MEM
packet with opration=1.
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Tested-by: Aaron Liu <aaron.liu@amd.com>
Tested-by: Yuxian Dai <Yuxian.Dai@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ba714a56