Commit 83321094 authored by Matthew Brost's avatar Matthew Brost Committed by John Harrison

drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

Add a delay, configurable via debugfs (default 34ms), to disable
scheduling of a context after the pin count goes to zero. Disable
scheduling is a costly operation as it requires synchronizing with
the GuC. So the idea is that a delay allows the user to resubmit
something before doing this operation. This delay is only done if
the context isn't closed and less than a given threshold
(default is 3/4) of the guc_ids are in use.

Alan Previn: Matt Brost first introduced this patch back in Oct 2021.
However no real world workload with measured performance impact was
available to prove the intended results. Today, this series is being
republished in response to a real world workload that benefited greatly
from it along with measured performance improvement.

Workload description: 36 containers were created on a DG2 device where
each container was performing a combination of 720p 3d game rendering
and 30fps video encoding. The workload density was configured in a way
that guaranteed each container to ALWAYS be able to render and
encode no less than 30fps with a predefined maximum render + encode
latency time. That means the totality of all 36 containers and their
workloads were not saturating the engines to their max (in order to
maintain just enough headrooom to meet the min fps and max latencies
of incoming container submissions).

Problem statement: It was observed that the CPU core processing the i915
soft IRQ work was experiencing severe load. Using tracelogs and an
instrumentation patch to count specific i915 IRQ events, it was confirmed
that the majority of the CPU cycles were caused by the
gen11_other_irq_handler() -> guc_irq_handler() code path. The vast
majority of the cycles was determined to be processing a specific G2H
IRQ: i.e. INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_DONE. These IRQs are sent
by GuC in response to i915 KMD sending H2G requests:
INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET. Those H2G requests are sent
whenever a context goes idle so that we can unpin the context from GuC.
The high CPU utilization % symptom was limiting density scaling.

Root Cause Analysis: Because the incoming execution buffers were spread
across 36 different containers (each with multiple contexts) but the
system in totality was NOT saturated to the max, it was assumed that each
context was constantly idling between submissions. This was causing
a thrashing of unpinning contexts from GuC at one moment, followed quickly
by repinning them due to incoming workload the very next moment. These
event-pairs were being triggered across multiple contexts per container,
across all containers at the rate of > 30 times per sec per context.

Metrics: When running this workload without this patch, we measured an
average of ~69K INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_DONE events every 10
seconds or ~10 million times over ~25+ mins. With this patch, the count
reduced to ~480 every 10 seconds or about ~28K over ~10 mins. The
improvement observed is ~99% for the average counts per 10 seconds.

Design awareness: Selftest impact.
As temporary WA disable this feature for the selftests. Selftests are
very timing sensitive and any change in timing can cause failure. A
follow up patch will fixup the selftests to understand this delay.

Design awareness: Race between guc_request_alloc and guc_context_close.
If a context close is issued while there is a request submission in
flight and a delayed schedule disable is pending, guc_context_close
and guc_request_alloc will race to cancel the delayed disable.
To close the race, make sure that guc_request_alloc waits for
guc_context_close to finish running before checking any state.

Design awareness: GT Reset event.
If a gt reset is triggered, as preparation steps, add an additional step
to ensure all contexts that have a pending delay-disable-schedule task
be flushed of it. Move them directly into the closed state after cancelling
the worker. This is okay because the existing flow flushes all
yet-to-arrive G2H's dropping them anyway.
Signed-off-by: default avatarMatthew Brost <matthew.brost@intel.com>
Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
Signed-off-by: default avatarDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: default avatarJohn Harrison <John.C.Harrison@Intel.com>
Signed-off-by: default avatarJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221006225121.826257-2-alan.previn.teres.alexis@intel.com
parent befb231d
...@@ -1452,7 +1452,7 @@ static void engines_idle_release(struct i915_gem_context *ctx, ...@@ -1452,7 +1452,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
int err; int err;
/* serialises with execbuf */ /* serialises with execbuf */
set_bit(CONTEXT_CLOSED_BIT, &ce->flags); intel_context_close(ce);
if (!intel_context_pin_if_active(ce)) if (!intel_context_pin_if_active(ce))
continue; continue;
......
...@@ -276,6 +276,14 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce) ...@@ -276,6 +276,14 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
return test_bit(CONTEXT_BARRIER_BIT, &ce->flags); return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
} }
static inline void intel_context_close(struct intel_context *ce)
{
set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
if (ce->ops->close)
ce->ops->close(ce);
}
static inline bool intel_context_is_closed(const struct intel_context *ce) static inline bool intel_context_is_closed(const struct intel_context *ce)
{ {
return test_bit(CONTEXT_CLOSED_BIT, &ce->flags); return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
......
...@@ -43,6 +43,8 @@ struct intel_context_ops { ...@@ -43,6 +43,8 @@ struct intel_context_ops {
void (*revoke)(struct intel_context *ce, struct i915_request *rq, void (*revoke)(struct intel_context *ce, struct i915_request *rq,
unsigned int preempt_timeout_ms); unsigned int preempt_timeout_ms);
void (*close)(struct intel_context *ce);
int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr); int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
int (*pin)(struct intel_context *ce, void *vaddr); int (*pin)(struct intel_context *ce, void *vaddr);
void (*unpin)(struct intel_context *ce); void (*unpin)(struct intel_context *ce);
...@@ -208,6 +210,11 @@ struct intel_context { ...@@ -208,6 +210,11 @@ struct intel_context {
* each priority bucket * each priority bucket
*/ */
u32 prio_count[GUC_CLIENT_PRIORITY_NUM]; u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
/**
* @sched_disable_delay_work: worker to disable scheduling on this
* context
*/
struct delayed_work sched_disable_delay_work;
} guc_state; } guc_state;
struct { struct {
......
...@@ -112,6 +112,10 @@ struct intel_guc { ...@@ -112,6 +112,10 @@ struct intel_guc {
* refs * refs
*/ */
struct list_head guc_id_list; struct list_head guc_id_list;
/**
* @guc_ids_in_use: Number single-lrc guc_ids in use
*/
unsigned int guc_ids_in_use;
/** /**
* @destroyed_contexts: list of contexts waiting to be destroyed * @destroyed_contexts: list of contexts waiting to be destroyed
* (deregistered with the GuC) * (deregistered with the GuC)
...@@ -132,6 +136,16 @@ struct intel_guc { ...@@ -132,6 +136,16 @@ struct intel_guc {
* @reset_fail_mask: mask of engines that failed to reset * @reset_fail_mask: mask of engines that failed to reset
*/ */
intel_engine_mask_t reset_fail_mask; intel_engine_mask_t reset_fail_mask;
/**
* @sched_disable_delay_ms: schedule disable delay, in ms, for
* contexts
*/
unsigned int sched_disable_delay_ms;
/**
* @sched_disable_gucid_threshold: threshold of min remaining available
* guc_ids before we start bypassing the schedule disable delay
*/
unsigned int sched_disable_gucid_threshold;
} submission_state; } submission_state;
/** /**
...@@ -466,4 +480,6 @@ void intel_guc_write_barrier(struct intel_guc *guc); ...@@ -466,4 +480,6 @@ void intel_guc_write_barrier(struct intel_guc *guc);
void intel_guc_dump_time_info(struct intel_guc *guc, struct drm_printer *p); void intel_guc_dump_time_info(struct intel_guc *guc, struct drm_printer *p);
int intel_guc_sched_disable_gucid_threshold_max(struct intel_guc *guc);
#endif #endif
...@@ -71,12 +71,73 @@ static bool intel_eval_slpc_support(void *data) ...@@ -71,12 +71,73 @@ static bool intel_eval_slpc_support(void *data)
return intel_guc_slpc_is_used(guc); return intel_guc_slpc_is_used(guc);
} }
static int guc_sched_disable_delay_ms_get(void *data, u64 *val)
{
struct intel_guc *guc = data;
if (!intel_guc_submission_is_used(guc))
return -ENODEV;
*val = (u64)guc->submission_state.sched_disable_delay_ms;
return 0;
}
static int guc_sched_disable_delay_ms_set(void *data, u64 val)
{
struct intel_guc *guc = data;
if (!intel_guc_submission_is_used(guc))
return -ENODEV;
/* clamp to a practical limit, 1 minute is reasonable for a longest delay */
guc->submission_state.sched_disable_delay_ms = min_t(u64, val, 60000);
return 0;
}
DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ms_fops,
guc_sched_disable_delay_ms_get,
guc_sched_disable_delay_ms_set, "%lld\n");
static int guc_sched_disable_gucid_threshold_get(void *data, u64 *val)
{
struct intel_guc *guc = data;
if (!intel_guc_submission_is_used(guc))
return -ENODEV;
*val = guc->submission_state.sched_disable_gucid_threshold;
return 0;
}
static int guc_sched_disable_gucid_threshold_set(void *data, u64 val)
{
struct intel_guc *guc = data;
if (!intel_guc_submission_is_used(guc))
return -ENODEV;
if (val > intel_guc_sched_disable_gucid_threshold_max(guc))
guc->submission_state.sched_disable_gucid_threshold =
intel_guc_sched_disable_gucid_threshold_max(guc);
else
guc->submission_state.sched_disable_gucid_threshold = val;
return 0;
}
DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_gucid_threshold_fops,
guc_sched_disable_gucid_threshold_get,
guc_sched_disable_gucid_threshold_set, "%lld\n");
void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root) void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
{ {
static const struct intel_gt_debugfs_file files[] = { static const struct intel_gt_debugfs_file files[] = {
{ "guc_info", &guc_info_fops, NULL }, { "guc_info", &guc_info_fops, NULL },
{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL }, { "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
{ "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support}, { "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support},
{ "guc_sched_disable_delay_ms", &guc_sched_disable_delay_ms_fops, NULL },
{ "guc_sched_disable_gucid_threshold", &guc_sched_disable_gucid_threshold_fops,
NULL },
}; };
if (!intel_guc_is_supported(guc)) if (!intel_guc_is_supported(guc))
......
...@@ -92,12 +92,14 @@ int __i915_subtests(const char *caller, ...@@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
T, ARRAY_SIZE(T), data) T, ARRAY_SIZE(T), data)
#define i915_live_subtests(T, data) ({ \ #define i915_live_subtests(T, data) ({ \
typecheck(struct drm_i915_private *, data); \ typecheck(struct drm_i915_private *, data); \
(data)->gt[0]->uc.guc.submission_state.sched_disable_delay_ms = 0; \
__i915_subtests(__func__, \ __i915_subtests(__func__, \
__i915_live_setup, __i915_live_teardown, \ __i915_live_setup, __i915_live_teardown, \
T, ARRAY_SIZE(T), data); \ T, ARRAY_SIZE(T), data); \
}) })
#define intel_gt_live_subtests(T, data) ({ \ #define intel_gt_live_subtests(T, data) ({ \
typecheck(struct intel_gt *, data); \ typecheck(struct intel_gt *, data); \
(data)->uc.guc.submission_state.sched_disable_delay_ms = 0; \
__i915_subtests(__func__, \ __i915_subtests(__func__, \
__intel_gt_live_setup, __intel_gt_live_teardown, \ __intel_gt_live_setup, __intel_gt_live_teardown, \
T, ARRAY_SIZE(T), data); \ T, ARRAY_SIZE(T), data); \
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment