1. 02 Jul, 2020 32 commits
  2. 01 Jul, 2020 8 commits
    • Ivan Mironov's avatar
      drm/amd/powerplay: Fix NULL dereference in lock_bus() on Vega20 w/o RAS · 78083631
      Ivan Mironov authored
      I updated my system with Radeon VII from kernel 5.6 to kernel 5.7, and
      following started to happen on each boot:
      
      	...
      	BUG: kernel NULL pointer dereference, address: 0000000000000128
      	...
      	CPU: 9 PID: 1940 Comm: modprobe Tainted: G            E     5.7.2-200.im0.fc32.x86_64 #1
      	Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 1407 04/02/2020
      	RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
      	...
      	Call Trace:
      	 i2c_smbus_xfer+0x3d/0xf0
      	 i2c_default_probe+0xf3/0x130
      	 i2c_detect.isra.0+0xfe/0x2b0
      	 ? kfree+0xa3/0x200
      	 ? kobject_uevent_env+0x11f/0x6a0
      	 ? i2c_detect.isra.0+0x2b0/0x2b0
      	 __process_new_driver+0x1b/0x20
      	 bus_for_each_dev+0x64/0x90
      	 ? 0xffffffffc0f34000
      	 i2c_register_driver+0x73/0xc0
      	 do_one_initcall+0x46/0x200
      	 ? _cond_resched+0x16/0x40
      	 ? kmem_cache_alloc_trace+0x167/0x220
      	 ? do_init_module+0x23/0x260
      	 do_init_module+0x5c/0x260
      	 __do_sys_init_module+0x14f/0x170
      	 do_syscall_64+0x5b/0xf0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      	...
      
      Error appears when some i2c device driver tries to probe for devices
      using adapter registered by `smu_v11_0_i2c_eeprom_control_init()`.
      Code supporting this adapter requires `adev->psp.ras.ras` to be not
      NULL, which is true only when `amdgpu_ras_init()` detects HW support by
      calling `amdgpu_ras_check_supported()`.
      
      Before 9015d60c, adapter was registered by
      
      	-> amdgpu_device_ip_init()
      	  -> amdgpu_ras_recovery_init()
      	    -> amdgpu_ras_eeprom_init()
      	      -> smu_v11_0_i2c_eeprom_control_init()
      
      after verifying that `adev->psp.ras.ras` is not NULL in
      `amdgpu_ras_recovery_init()`. Currently it is registered
      unconditionally by
      
      	-> amdgpu_device_ip_init()
      	  -> pp_sw_init()
      	    -> hwmgr_sw_init()
      	      -> vega20_smu_init()
      	        -> smu_v11_0_i2c_eeprom_control_init()
      
      Fix simply adds HW support check (ras == NULL => no support) before
      calling `smu_v11_0_i2c_eeprom_control_{init,fini}()`.
      
      Please note that there is a chance that similar fix is also required for
      CHIP_ARCTURUS. I do not know whether any actual Arcturus hardware without
      RAS exist, and whether calling `smu_i2c_eeprom_init()` makes any sense
      when there is no HW support.
      
      Cc: stable@vger.kernel.org
      Fixes: 9015d60c ("drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device")
      Signed-off-by: default avatarIvan Mironov <mironov.ivan@gmail.com>
      Tested-by: default avatarBjorn Nostvold <bjorn.nostvold@gmail.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      78083631
    • Alex Deucher's avatar
      drm/amdgpu: enable runtime pm on vega10 when noretry=0 · cd527780
      Alex Deucher authored
      The failures with ROCm only happen with noretry=1, so
      enable runtime pm when noretry=0 (the current default).
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Acked-by: default avatarRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      cd527780
    • Alex Deucher's avatar
      drm/amdgpu: rework runtime pm enablement for BACO · b38c6968
      Alex Deucher authored
      Add a switch statement to simplify asic checks.  Note
      that BACO is not supported on APUs, so there is no
      need to check them.
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      b38c6968
    • Nirmoy Das's avatar
      drm/amdgpu: call release_firmware() without a NULL check · 75e1658e
      Nirmoy Das authored
      The release_firmware() function is NULL tolerant so we do not need
      to check for NULL param before calling it.
      Signed-off-by: default avatarNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      75e1658e
    • Mukul Joshi's avatar
      drm/amdkfd: Fix circular locking dependency warning · d69fd951
      Mukul Joshi authored
      [  150.887733] ======================================================
      [  150.893903] WARNING: possible circular locking dependency detected
      [  150.905917] ------------------------------------------------------
      [  150.912129] kfdtest/4081 is trying to acquire lock:
      [  150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at:
                                       __might_fault+0x3e/0x90
      [  150.924490]
                     but task is already holding lock:
      [  150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at:
                                      destroy_queue_cpsch+0x29/0x210 [amdgpu]
      [  150.939432]
                     which lock already depends on the new lock.
      
      [  150.947603]
                     the existing dependency chain (in reverse order) is:
      [  150.955074]
                     -> #3 (&dqm->lock_hidden){+.+.}:
      [  150.960822]        __mutex_lock+0xa1/0x9f0
      [  150.964996]        evict_process_queues_cpsch+0x22/0x120 [amdgpu]
      [  150.971155]        kfd_process_evict_queues+0x3b/0xc0 [amdgpu]
      [  150.977054]        kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu]
      [  150.982442]        amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu]
      [  150.988615]        amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu]
      [  150.994448]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.000714]        copy_page_range+0xd70/0xd80
      [  151.005159]        dup_mm+0x3ca/0x550
      [  151.008816]        copy_process+0x1bdc/0x1c70
      [  151.013183]        _do_fork+0x76/0x6c0
      [  151.016929]        __x64_sys_clone+0x8c/0xb0
      [  151.021201]        do_syscall_64+0x4a/0x1d0
      [  151.025404]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.030977]
                     -> #2 (&adev->notifier_lock){+.+.}:
      [  151.036993]        __mutex_lock+0xa1/0x9f0
      [  151.041168]        amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu]
      [  151.047019]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.053277]        copy_page_range+0xd70/0xd80
      [  151.057722]        dup_mm+0x3ca/0x550
      [  151.061388]        copy_process+0x1bdc/0x1c70
      [  151.065748]        _do_fork+0x76/0x6c0
      [  151.069499]        __x64_sys_clone+0x8c/0xb0
      [  151.073765]        do_syscall_64+0x4a/0x1d0
      [  151.077952]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.083523]
                     -> #1 (mmu_notifier_invalidate_range_start){+.+.}:
      [  151.090833]        change_protection+0x802/0xab0
      [  151.095448]        mprotect_fixup+0x187/0x2d0
      [  151.099801]        setup_arg_pages+0x124/0x250
      [  151.104251]        load_elf_binary+0x3a4/0x1464
      [  151.108781]        search_binary_handler+0x6c/0x210
      [  151.113656]        __do_execve_file.isra.40+0x7f7/0xa50
      [  151.118875]        do_execve+0x21/0x30
      [  151.122632]        call_usermodehelper_exec_async+0x17e/0x190
      [  151.128393]        ret_from_fork+0x24/0x30
      [  151.132489]
                     -> #0 (&mm->mmap_sem#2){++++}:
      [  151.138064]        __lock_acquire+0x11a1/0x1490
      [  151.142597]        lock_acquire+0x90/0x180
      [  151.146694]        __might_fault+0x68/0x90
      [  151.150879]        read_sdma_queue_counter+0x5f/0xb0 [amdgpu]
      [  151.156693]        update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu]
      [  151.163725]        destroy_queue_cpsch+0x1ae/0x210 [amdgpu]
      [  151.169373]        pqm_destroy_queue+0xf0/0x250 [amdgpu]
      [  151.174762]        kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
      [  151.180577]        kfd_ioctl+0x223/0x400 [amdgpu]
      [  151.185284]        ksys_ioctl+0x8f/0xb0
      [  151.189118]        __x64_sys_ioctl+0x16/0x20
      [  151.193389]        do_syscall_64+0x4a/0x1d0
      [  151.197569]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.203141]
                     other info that might help us debug this:
      
      [  151.211140] Chain exists of:
                       &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden
      
      [  151.222535]  Possible unsafe locking scenario:
      
      [  151.228447]        CPU0                    CPU1
      [  151.232971]        ----                    ----
      [  151.237502]   lock(&dqm->lock_hidden);
      [  151.241254]                                lock(&adev->notifier_lock);
      [  151.247774]                                lock(&dqm->lock_hidden);
      [  151.254038]   lock(&mm->mmap_sem#2);
      
      This commit fixes the warning by ensuring get_user() is not called
      while reading SDMA stats with dqm_lock held as get_user() could cause a
      page fault which leads to the circular locking scenario.
      Signed-off-by: default avatarMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      d69fd951
    • Colin Ian King's avatar
      drm/radeon: fix array out-of-bounds read and write issues · 7ee78aff
      Colin Ian King authored
      There is an off-by-one bounds check on the index into arrays
      table->mc_reg_address and table->mc_reg_table_entry[k].mc_data[j] that
      can lead to reads and writes outside of arrays. Fix the bound checking
      off-by-one error.
      
      Addresses-Coverity: ("Out-of-bounds read/write")
      Fixes: cc8dbbb4 ("drm/radeon: add dpm support for CI dGPUs (v2)")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      7ee78aff
    • Colin Ian King's avatar
      drm/amdgpu: ensure 0 is returned for success in jpeg_v2_5_wait_for_idle · 57f01856
      Colin Ian King authored
      In the cases where adev->jpeg.num_jpeg_inst is zero or the condition
      adev->jpeg.harvest_config & (1 << i) is always non-zero the variable
      ret is never set to an error condition and the function returns
      an uninitialized value in ret.  Since the only exit condition at
      the end if the function is a success then explicitly return
      0 rather than a potentially uninitialized value in ret.
      
      Addresses-Coverity: ("Uninitialized scalar variable")
      Fixes: 14f43e8f ("drm/amdgpu: move JPEG2.5 out from VCN2.5")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      57f01856
    • Alex Deucher's avatar
      drm/amdgpu: make sure to reserve tmr region on all asics which support it · 6a8987a8
      Alex Deucher authored
      This includes older APUs like renoir.
      Acked-by: default avatarNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: default avatarHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      6a8987a8