1. 23 Jul, 2024 33 commits
  2. 22 Jul, 2024 4 commits
  3. 18 Jul, 2024 3 commits
    • Matthew Brost's avatar
      drm/xe: Don't suspend device upon wedge · 90936a0a
      Matthew Brost authored
      When wedging a device we shouldn't be suspending device as state for
      debug will be lost.
      
      Also this appears to not work as the below stack trace pops upon trying
      to resume a wedged device:
      
      [  304.245044] INFO: task cat:12115 blocked for more than 151 seconds.
      [  304.251333]       Tainted: G        W          6.10.0-rc7-xe+ #3518
      [  304.257617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  304.265459] task:cat             state:D stack:13384 pid:12115 tgid:12115 ppid:3986   flags:0x00000006
      [  304.265465] Call Trace:
      [  304.265467]  <TASK>
      [  304.265469]  __schedule+0x3c4/0xdf0
      [  304.265478]  schedule+0x3c/0x140
      [  304.265481]  rpm_resume+0x1cc/0x740
      [  304.265484]  ? __pfx_autoremove_wake_function+0x10/0x10
      [  304.265489]  __pm_runtime_resume+0x49/0x80
      [  304.265494]  guc_info+0x6b/0xb0 [xe]
      [  304.265538]  ? __pfx___drm_printfn_seq_file+0x10/0x10
      [  304.265541]  ? __pfx___drm_puts_seq_file+0x10/0x10
      [  304.265545]  seq_read_iter+0x111/0x4c0
      [  304.265551]  seq_read+0xfc/0x140
      [  304.265556]  full_proxy_read+0x58/0x80
      [  304.265560]  vfs_read+0xa7/0x360
      [  304.265563]  ? find_held_lock+0x2b/0x80
      [  304.265568]  ksys_read+0x64/0xe0
      [  304.265571]  do_syscall_64+0x68/0x140
      [  304.265575]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
      [  304.265578] RIP: 0033:0x7f4254d14992
      [  304.265580] RSP: 002b:00007ffc558666f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [  304.265583] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f4254d14992
      [  304.265584] RDX: 0000000000020000 RSI: 00007f4254ebb000 RDI: 0000000000000003
      [  304.265586] RBP: 00007f4254ebb000 R08: 00007f4254eba010 R09: 00007f4254eba010
      [  304.265587] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000022000
      [  304.265588] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
      [  304.265593]  </TASK>
      [  304.265594]
                     Showing all locks held in the system:
      [  304.265598] 1 lock held by khungtaskd/57:
      [  304.265599]  #0: ffffffff8273b860 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x1c0
      [  304.265607] 3 locks held by kworker/6:1/90:
      [  304.265610] 1 lock held by in:imklog/547:
      [  304.265611]  #0: ffff88810498cd88 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x76/0xc0
      [  304.265620] 1 lock held by dmesg/1310:
      
      v2: Drop local 'err' variable (Jonathan)
      
      Fixes: 8ed9aaae ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarMatthew Brost <matthew.brost@intel.com>
      Reviewed-by: default avatarJonathan Cavitt <jonathan.cavitt@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240716063902.1390130-2-matthew.brost@intel.com
      (cherry picked from commit 452bca0e)
      Signed-off-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      90936a0a
    • Matthew Brost's avatar
      drm/xe: Wedge the entire device · c9474b72
      Matthew Brost authored
      Wedge the entire device, not just GT which may have triggered the wedge.
      To implement this, cleanup the layering so xe_device_declare_wedged()
      calls into the lower layers (GT) to ensure entire device is wedged.
      
      While we are here, also signal any pending GT TLB invalidations upon
      wedging device.
      
      Lastly, short circuit reset wait if device is wedged.
      
      v2:
       - Short circuit reset wait if device is wedged (Local testing)
      
      Fixes: 8ed9aaae ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarMatthew Brost <matthew.brost@intel.com>
      Reviewed-by: default avatarJonathan Cavitt <jonathan.cavitt@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240716063902.1390130-1-matthew.brost@intel.com
      (cherry picked from commit 7dbe8af1)
      Signed-off-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      c9474b72
    • Michal Wajdeczko's avatar
      drm/xe/pf: Limit fair VF LMEM provisioning · bf07ca96
      Michal Wajdeczko authored
      Due to the current design of the BO and VRAM manager, any object
      with XE_BO_FLAG_PINNED flag, which the PF driver uses during VF
      LMEM provisionining, is created with the TTM_PL_FLAG_CONTIGUOUS
      flag, which may cause VRAM fragmentation that prevents subsequent
      allocations of larger objects, like fair VF LMEM provisioning.
      
      To avoid such failures, round down fair VF LMEM provisioning size
      to next power of two size, to compensate what xe_ttm_vram_mgr is
      doing to achieve contiguous allocations.
      
      Fixes: ac6598ae ("drm/xe/pf: Add support to configure SR-IOV VFs")
      Signed-off-by: default avatarMichal Wajdeczko <michal.wajdeczko@intel.com>
      Reviewed-by: default avatarPiotr Piórkowski <piotr.piorkowski@intel.com>
      Reviewed-by: default avatarJonathan Cavitt <jonathan.cavitt@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240711192320.1198-2-michal.wajdeczko@intel.comSigned-off-by: default avatarLucas De Marchi <lucas.demarchi@intel.com>
      (cherry picked from commit 4c3fe5ea)
      Signed-off-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      bf07ca96