1. 24 Sep, 2021 9 commits
    • Matthew Auld's avatar
      drm/i915/request: fix early tracepoints · be988eae
      Matthew Auld authored
      Currently we blow up in trace_dma_fence_init, when calling into
      get_driver_name or get_timeline_name, since both the engine and context
      might be NULL(or contain some garbage address) in the case of newly
      allocated slab objects via the request ctor. Note that we also use
      SLAB_TYPESAFE_BY_RCU here, which allows requests to be immediately
      freed, but delay freeing the underlying page by an RCU grace period.
      With this scheme requests can be re-allocated, at the same time as they
      are also being read by some lockless RCU lookup mechanism.
      
      In the ctor case, which is only called for new slab objects(i.e allocate
      new page and call the ctor for each object) it's safe to reset the
      context/engine prior to calling into dma_fence_init, since we can be
      certain that no one is doing an RCU lookup which might depend on peeking
      at the engine/context, like in active_engine(), since the object can't
      yet be externally visible.
      
      In the recycled case(which might also be externally visible) the request
      refcount always transitions from 0->1 after we set the context/engine
      etc, which should ensure it's valid to dereference the engine for
      example, when doing an RCU list-walk, so long as we can also increment
      the refcount first. If the refcount is already zero, then the request is
      considered complete/released.  If it's non-zero, then the request might
      be in the process of being re-allocated, or potentially still in flight,
      however after successfully incrementing the refcount, it's possible to
      carefully inspect the request state, to determine if the request is
      still what we were looking for. Note that all externally visible
      requests returned to the cache must have zero refcount.
      
      One possible fix then is to move dma_fence_init out from the request
      ctor. Originally this was how it was done, but it was moved in:
      
      commit 855e39e6
      Author: Chris Wilson <chris@chris-wilson.co.uk>
      Date:   Mon Feb 3 09:41:48 2020 +0000
      
          drm/i915: Initialise basic fence before acquiring seqno
      
      where it looks like intel_timeline_get_seqno() relied on some of the
      rq->fence state, but that is no longer the case since:
      
      commit 12ca695d
      Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Date:   Tue Mar 23 16:49:50 2021 +0100
      
          drm/i915: Do not share hwsp across contexts any more, v8.
      
      intel_timeline_get_seqno() could also be cleaned up slightly by dropping
      the request argument.
      
      Moving dma_fence_init back out of the ctor, should ensure we have enough
      of the request initialised in case of trace_dma_fence_init.
      Functionally this should be the same, and is effectively what we were
      already open coding before, except now we also assign the fence->lock
      and fence->ops, but since these are invariant for recycled
      requests(which might be externally visible), and will therefore already
      hold the same value, it shouldn't matter.
      
      An alternative fix, since we don't yet have a fully initialised request
      when in the ctor, is just setting the context/engine as NULL, but this
      does require adding some extra handling in get_driver_name etc.
      
      v2(Daniel):
        - Try to make the commit message less confusing
      
      Fixes: 855e39e6 ("drm/i915: Initialise basic fence before acquiring seqno")
      Signed-off-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Cc: Michael Mason <michael.w.mason@intel.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210921134202.3803151-1-matthew.auld@intel.com
      be988eae
    • Thomas Hellström's avatar
      drm/i915: Reduce the number of objects subject to memcpy recover · a259cc14
      Thomas Hellström authored
      We really only need memcpy restore for objects that affect the
      operability of the migrate context. That is, primarily the page-table
      objects of the migrate VM.
      
      Add an object flag, I915_BO_ALLOC_PM_EARLY for objects that need early
      restores using memcpy and a way to assign LMEM page-table object flags
      to be used by the vms.
      
      Restore objects without this flag with the gpu blitter and only objects
      carrying the flag using TTM memcpy.
      
      Initially mark the migrate, gt, gtt and vgpu vms to use this flag, and
      defer for a later audit which vms actually need it. Most importantly, user-
      allocated vms with pinned page-table objects can be restored using the
      blitter.
      
      Performance-wise memcpy restore is probably as fast as gpu restore if not
      faster, but using gpu restore will help tackling future restrictions in
      mappable LMEM size.
      
      v4:
      - Don't mark the aliasing ppgtt page table flags for early resume, but
        rather the ggtt page table flags as intended. (Matthew Auld)
      - The check for user buffer objects during early resume is pointless, since
        they are never marked I915_BO_ALLOC_PM_EARLY. (Matthew Auld)
      v5:
      - Mark GuC LMEM objects with I915_BO_ALLOC_PM_EARLY to have them restored
        before we fire up the migrate context.
      
      Cc: Matthew Brost <matthew.brost@intel.com>
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-8-thomas.hellstrom@linux.intel.com
      a259cc14
    • Thomas Hellström's avatar
      drm/i915: Don't back up pinned LMEM context images and rings during suspend · 0d8ee5ba
      Thomas Hellström authored
      Pinned context images are now reset during resume. Don't back them up,
      and assuming that rings can be assumed empty at suspend, don't back them
      up either.
      
      Introduce a new object flag, I915_BO_ALLOC_PM_VOLATILE meaning that an
      object is allowed to lose its content on suspend.
      
      v3:
      - Slight documentation clarification (Matthew Auld)
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-7-thomas.hellstrom@linux.intel.com
      0d8ee5ba
    • Thomas Hellström's avatar
      drm/i915/gt: Register the migrate contexts with their engines · 3e42cc61
      Thomas Hellström authored
      Pinned contexts, like the migrate contexts need reset after resume
      since their context image may have been lost. Also the GuC needs to
      register pinned contexts.
      
      Add a list to struct intel_engine_cs where we add all pinned contexts on
      creation, and traverse that list at resume time to reset the pinned
      contexts.
      
      This fixes the kms_pipe_crc_basic@suspend-read-crc-pipe-a selftest for now,
      but proper LMEM backup / restore is needed for full suspend functionality.
      However, note that even with full LMEM backup / restore it may be
      desirable to keep the reset since backing up the migrate context images
      must happen using memcpy() after the migrate context has become inactive,
      and for performance- and other reasons we want to avoid memcpy() from
      LMEM.
      
      Also traverse the list at guc_init_lrc_mapping() calling
      guc_kernel_context_pin() for the pinned contexts, like is already done
      for the kernel context.
      
      v2:
      - Don't reset the contexts on each __engine_unpark() but rather at
        resume time (Chris Wilson).
      v3:
      - Reset contexts in the engine sanitize callback. (Chris Wilson)
      
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Brost Matthew <matthew.brost@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-6-thomas.hellstrom@linux.intel.com
      3e42cc61
    • Thomas Hellström's avatar
      drm/i915 Implement LMEM backup and restore for suspend / resume · c56ce956
      Thomas Hellström authored
      Just evict unpinned objects to system. For pinned LMEM objects,
      make a backup system object and blit the contents to that.
      
      Backup is performed in three steps,
      1: Opportunistically evict evictable objects using the gpu blitter.
      2: After gt idle, evict evictable objects using the gpu blitter. This will
      be modified in an upcoming patch to backup pinned objects that are not used
      by the blitter itself.
      3: Backup remaining pinned objects using memcpy.
      
      Also move uC suspend to after 2) to make sure we have a functional GuC
      during 2) if using GuC submission.
      
      v2:
      - Major refactor to make sure gem_exec_suspend@hang-SX subtests work, and
        suspend / resume works with a slightly modified GuC submission enabling
        patch series.
      
      v3:
      - Fix a potential use-after-free (Matthew Auld)
      - Use i915_gem_object_create_shmem() instead of
        i915_gem_object_create_region (Matthew Auld)
      - Minor simplifications (Matthew Auld)
      - Fix up kerneldoc for i195_ttm_restore_region().
      - Final lmem_suspend() call moved to i915_gem_backup_suspend from
        i915_gem_suspend_late, since the latter gets called at driver unload
        and we don't unnecessarily want to run it at that time.
      
      v4:
      - Interface change of ttm- & lmem suspend / resume functions to use
        flags rather than bools. (Matthew Auld)
      - Completely drop the i915_gem_backup_suspend change (Matthew Auld)
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-5-thomas.hellstrom@linux.intel.com
      c56ce956
    • Thomas Hellström's avatar
      drm/i915/gt: Increase suspend timeout · 81387fc4
      Thomas Hellström authored
      With GuC submission on DG1, the execution of the requests times out
      for the gem_exec_suspend igt test case after executing around 800-900
      of 1000 submitted requests.
      
      Given the time we allow elsewhere for fences to signal (in the order of
      seconds), increase the timeout before we mark the gt wedged and proceed.
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Acked-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-4-thomas.hellstrom@linux.intel.com
      81387fc4
    • Thomas Hellström's avatar
      drm/i915/gem: Implement a function to process all gem objects of a region · d80ee88e
      Thomas Hellström authored
      An upcoming common pattern is to traverse the region object list and
      perform certain actions on all objects in a region. It's a little tricky
      to get the list locking right, in particular since a gem object may
      change region unless it's pinned or the object lock is held.
      
      Define a function that does this for us and that takes an argument that
      defines the action to be performed on each object.
      
      v3:
      - Improve structure documentation a bit (Matthew Auld)
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-3-thomas.hellstrom@linux.intel.com
      d80ee88e
    • Thomas Hellström's avatar
      drm/i915/ttm: Implement a function to copy the contents of two TTM-based objects · 0d938863
      Thomas Hellström authored
      When backing up or restoring contents of pinned objects at suspend /
      resume time we need to allocate a new object as the backup. Add a function
      to facilitate copies between the two. Some data needs to be copied before
      the migration context is ready for operation, so make sure we can
      disable accelerated copies.
      
      v2:
      - Fix a missing return value check (Matthew Auld)
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-2-thomas.hellstrom@linux.intel.com
      0d938863
    • Thomas Hellström's avatar
      drm/i915/gem: Fix a lockdep warning the __i915_gem_is_lmem() function · 2dfa597d
      Thomas Hellström authored
      Somehow we managed to invert the test for i915_gem_object_evictable(),
      which causes a warning in DG1 BAT, igt@debugfs_test@read_all_entries.
      
      Fix the lock check to only warn if the object *is* indeed evictable and
      not protected from eviction by fences.
      
      Cc: Matthew Brost <matthew.brost@intel.com>
      Fixes: 91160c83 ("drm/i915: Take pinning into account in __i915_gem_object_is_lmem")
      Signed-off-by: default avatarThomas Hellström <thomas.hellstrom@linux.intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210922083807.888206-2-thomas.hellstrom@linux.intel.com
      2dfa597d
  2. 23 Sep, 2021 2 commits
  3. 22 Sep, 2021 7 commits
  4. 21 Sep, 2021 2 commits
  5. 20 Sep, 2021 8 commits
  6. 19 Sep, 2021 4 commits
  7. 17 Sep, 2021 1 commit
    • Maarten Lankhorst's avatar
      kernel/locking: Add context to ww_mutex_trylock() · 12235da8
      Maarten Lankhorst authored
      i915 will soon gain an eviction path that trylock a whole lot of locks
      for eviction, getting dmesg failures like below:
      
        BUG: MAX_LOCK_DEPTH too low!
        turning off the locking correctness validator.
        depth: 48  max: 48!
        48 locks held by i915_selftest/5776:
         #0: ffff888101a79240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x88/0x160
         #1: ffffc900009778c0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.63+0x39/0x1b0 [i915]
         #2: ffff88800cf74de8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.63+0x5f/0x1b0 [i915]
         #3: ffff88810c7f9e38 (&vm->mutex/1){+.+.}-{3:3}, at: i915_vma_pin_ww+0x1c4/0x9d0 [i915]
         #4: ffff88810bad5768 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
         #5: ffff88810bad60e8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
        ...
         #46: ffff88811964d768 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
         #47: ffff88811964e0e8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
        INFO: lockdep is turned off.
      
      Fixing eviction to nest into ww_class_acquire is a high priority, but
      it requires a rework of the entire driver, which can only be done one
      step at a time.
      
      As an intermediate solution, add an acquire context to
      ww_mutex_trylock, which allows us to do proper nesting annotations on
      the trylocks, making the above lockdep splat disappear.
      
      This is also useful in regulator_lock_nested, which may avoid dropping
      regulator_nesting_mutex in the uncontended path, so use it there.
      
      TTM may be another user for this, where we could lock a buffer in a
      fastpath with list locks held, without dropping all locks we hold.
      
      [peterz: rework actual ww_mutex_trylock() implementations]
      Signed-off-by: default avatarMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/YUBGPdDDjKlxAuXJ@hirez.programming.kicks-ass.net
      12235da8
  8. 16 Sep, 2021 1 commit
  9. 15 Sep, 2021 3 commits
    • Janusz Krzysztofik's avatar
      drm/i915: Mark GPU wedging on driver unregister unrecoverable · dc34ca92
      Janusz Krzysztofik authored
      GPU wedged flag now set on driver unregister to prevent from further
      using the GPU can be then cleared unintentionally when calling
      __intel_gt_unset_wedged() still before the flag is finally marked
      unrecoverable.  We need to have it marked unrecoverable earlier.
      Implement that by replacing a call to intel_gt_set_wedged() in
      intel_gt_driver_unregister() with intel_gt_set_wedged_on_fini().
      
      With the above in place, intel_gt_set_wedged_on_fini() is now called
      twice on driver remove, second time from __intel_gt_disable().  This
      seems harmless, while dropping intel_gt_set_wedged_on_fini() from
      __intel_gt_disable() proved to break some driver probe error unwind
      paths as well as mock selftest exit path.
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: default avatarMichał Winiarski <michal.winiarski@intel.com>
      Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210903142837.216978-1-janusz.krzysztofik@linux.intel.com
      dc34ca92
    • Maarten Lankhorst's avatar
      drm/i915: Add mmap lock around vma_lookup() in the mman selftest. · ce079f6d
      Maarten Lankhorst authored
      Add mmap_read_lock/unlock around vma_lookup(). The core code requires
      this for lookups. Since we only check if the return value is NULL,
      we can immediately unlock.
      
      This fixes the following splat in the selftes:
      
      i915: Running i915_gem_mman_live_selftests/igt_mmap
      ------------[ cut here ]------------
      WARNING: CPU: 3 PID: 5654 at include/linux/mmap_lock.h:164 find_vma+0x4e/0xb0
      Modules linked in: i915(+) vgem fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_intel_dspcfg snd_hda_codec snd_hwdep e1000e snd_hda_core ptp snd_pcm ttm mei_me pps_core i2c_i801 prime_numbers i2c_smbus mei [last unloaded: i915]
      CPU: 3 PID: 5654 Comm: i915_selftest Tainted: G     U            5.15.0-rc1-CI-Trybot_7984+ #1
      Hardware name: Micro-Star International Co., Ltd. MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.00 10/31/2017
      RIP: 0010:find_vma+0x4e/0xb0
      Code: de 48 89 ef e8 d3 94 fe ff 48 85 c0 74 34 48 83 c4 08 5b 5d c3 48 8d bf 28 01 00 00 be ff ff ff ff e8 d6 46 8b 00 85 c0 75 c8 <0f> 0b 48 8b 85 b8 00 00 00 48 85 c0 75 c6 48 89 ef e8 12 26 87 00
      RSP: 0018:ffffc900013df980 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 00007f9df2b80000 RCX: 0000000000000000
      RDX: 0000000000000001 RSI: ffffffff822e314c RDI: ffffffff8233c83f
      RBP: ffff88811bafc840 R08: ffff888107d0ddb8 R09: 00000000fffffffe
      R10: 0000000000000001 R11: 00000000ffbae7ba R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88812a710000 R15: ffff888114fa42c0
      FS:  00007f9def9d4c00(0000) GS:ffff888266580000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f799627fe50 CR3: 000000011bbc2006 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __igt_mmap+0xe0/0x490 [i915]
       igt_mmap+0xd2/0x160 [i915]
       ? __trace_bprintk+0x6e/0x80
       __i915_subtests.cold.7+0x42/0x92 [i915]
       ? i915_perf_selftests+0x20/0x20 [i915]
       ? __i915_nop_setup+0x10/0x10 [i915]
       __run_selftests.part.3+0x10d/0x172 [i915]
       i915_live_selftests.cold.5+0x1f/0x47 [i915]
       i915_pci_probe+0x93/0x1d0 [i915]
      Signed-off-by: default avatarMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Closes: https://gitlab.freedesktop.org/drm/intel/issues/4129
      Link: https://patchwork.freedesktop.org/patch/msgid/20210915105946.394412-1-maarten.lankhorst@linux.intel.comReviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      ce079f6d
    • Joonas Lahtinen's avatar
      Merge drm/drm-next into drm-intel-gt-next · d5dd580d
      Joonas Lahtinen authored
      Close the divergence which has caused patches not to apply and
      have a solid baseline for the PXP patches that Rodrigo will send
      a topic branch PR for.
      Signed-off-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      d5dd580d
  10. 14 Sep, 2021 3 commits