1. 22 Jan, 2024 34 commits
  2. 18 Jan, 2024 6 commits
    • Yang Wang's avatar
      drm/amd/pm: enable amdgpu smu send message log · 0cd2bc06
      Yang Wang authored
      v1:
      enable amdgpu smu driver message log.
      
      v2:
      add smu/pmfw response value into debug log.
      Signed-off-by: default avatarYang Wang <KevinYang.Wang@amd.com>
      Reviewed-by: default avatarKenneth Feng <kenneth.feng@amd.com>
      Reviewed-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      0cd2bc06
    • Tao Zhou's avatar
      drm/amdgpu: update error condition check for umc_v12_0_query_error_address · a9e4f61d
      Tao Zhou authored
      Deferred error is also taken into account.
      Signed-off-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Reviewed-by: default avatarHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      a9e4f61d
    • Stanley.Yang's avatar
      drm/amdgpu: Skip do PCI error slot reset during RAS recovery · 601429cc
      Stanley.Yang authored
      Why:
          The PCI error slot reset maybe triggered after inject ue to UMC multi times, this
          caused system hang.
          [  557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume
          [  557.373718] [drm] PCIE GART of 512M enabled.
          [  557.373722] [drm] PTB located at 0x0000031FED700000
          [  557.373788] [drm] VRAM is lost due to GPU reset!
          [  557.373789] [drm] PSP is resuming...
          [  557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset
          [  557.547067] [drm] PCI error: detected callback, state(1)!!
          [  557.547069] [drm] No support for XGMI hive yet...
          [  557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter
          [  557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations
          [  557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered
          [  557.610492] [drm] PCI error: slot reset callback!!
          ...
          [  560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded!
          [  560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded!
          [  560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI
          [  560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G           OE     5.15.0-91-generic #101-Ubuntu
          [  560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023
          [  560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu]
          [  560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
          [  560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00
          [  560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202
          [  560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0
          [  560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010
          [  560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08
          [  560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
          [  560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000
          [  560.803889] FS:  0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000
          [  560.812973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [  560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0
          [  560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          [  560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
          [  560.843444] PKRU: 55555554
          [  560.846480] Call Trace:
          [  560.849225]  <TASK>
          [  560.851580]  ? show_trace_log_lvl+0x1d6/0x2ea
          [  560.856488]  ? show_trace_log_lvl+0x1d6/0x2ea
          [  560.861379]  ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
          [  560.867778]  ? show_regs.part.0+0x23/0x29
          [  560.872293]  ? __die_body.cold+0x8/0xd
          [  560.876502]  ? die_addr+0x3e/0x60
          [  560.880238]  ? exc_general_protection+0x1c5/0x410
          [  560.885532]  ? asm_exc_general_protection+0x27/0x30
          [  560.891025]  ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
          [  560.898323]  amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
          [  560.904520]  process_one_work+0x228/0x3d0
      How:
          In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected
          all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.
      Signed-off-by: default avatarStanley.Yang <Stanley.Yang@amd.com>
      Reviewed-by: default avatarHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      601429cc
    • Stanley.Yang's avatar
      drm/amdgpu: Show deferred error count for UMC · 2c7a1560
      Stanley.Yang authored
      Show deferred error count for UMC syfs node
      Signed-off-by: default avatarStanley.Yang <Stanley.Yang@amd.com>
      Reviewed-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Reviewed-by: default avatarHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      2c7a1560
    • Ori Messinger's avatar
      drm/amdgpu: Enable GFXOFF for Compute on GFX11 · 776b0953
      Ori Messinger authored
      On GFX version 11, GFXOFF was disabled due to a MES KIQ firmware
      issue, which has since been fixed after version 64.
      This patch only re-enables GFXOFF for GFX version 11 if the GPU's
      MES KIQ firmware version is newer than version 64.
      
      V2: Keep GFXOFF disabled on GFX11 if MES KIQ is below version 64.
      V3: Add parentheses to avoid GCC warning for parentheses:
      "suggest parentheses around comparison in operand of ‘&’"
      V4: Remove "V3" from commit title
      V5: Change commit description and insert 'Acked-by'
      Signed-off-by: default avatarOri Messinger <Ori.Messinger@amd.com>
      Acked-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      776b0953
    • Yang Wang's avatar
      drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[] · 7ed97155
      Yang Wang authored
      fix array index out of bounds issue for ras_block_string[] array.
      
      Fixes: 30df05fb ("drm/amdgpu: Align ras block enum with firmware")
      Signed-off-by: default avatarYang Wang <kevinyang.wang@amd.com>
      Reviewed-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      7ed97155