info:To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Aggregated Value Stream Analytics
> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/335391) in GitLab 14.7.
DISCLAIMER:
This page contains information related to upcoming products, features, and functionality.
It is important to note that the information presented is for informational purposes only.
Please do not rely on this information for purchasing or planning purposes.
As with all projects, the items mentioned on this page are subject to change or delay.
The development, release, and timing of any products, features, or functionality remain at the
sole discretion of GitLab Inc.
This page provides a high-level overview of the aggregated backend for
Value Stream Analytics (VSA).
## Current Status
As of 14.8 the aggregated VSA backend is used only in the `gitlab-org` group, for testing purposes
. We plan to gradually roll it out in the next major release (15.0) for the rest of the groups.
## Motivation
The aggregated backend aims to solve the performance limitations of the VSA feature and set it up
for long-term growth.
Our main database is not prepared for analytical workloads. Executing long-running queries can
affect the reliability of the application. For large groups, the current
implementation (old backend) is slow and, in some cases, doesn't even load due to the configured
statement timeout (15s).
The database queries in the old backend use the core domain models directly through
`IssuableFinders` classes: ([MergeRequestsFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/merge_requests_finder.rb) and [IssuesFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/issues_finder.rb)).
With the requested change of the [date range filters](https://gitlab.com/groups/gitlab-org/-/epics/6046),
this approach was no longer viable from the performance point of view.
Benefits of the aggregated VSA backend:
- Simpler database queries (fewer JOINs).
- Faster aggregations, only a single table is accessed.
- Possibility to introduce further aggregations for improving the first page load time.
- Better performance for large groups (with many sub-groups, projects, issues and, merge requests).
- Ready for database decomposition. The VSA related database tables could live in a separate
database with a minimal development effort.
- Ready for keyset pagination which can be useful for exporting the data.
- Possibility to implement more complex event definitions.
- For example, the start event can be two timestamp columns where the earliest value would be
In this example, there are two independent value streams set up for two teams that are using
different development workflows within the `Test Group` (top-level namespace).
The first value stream uses standard timestamp-based events for defining the stages. The second
value stream uses label events.
Each value stream and stage item from the example will be persisted in the database. Notice that
the `Deployment` stage is identical for both value streams; that means that the underlying
`stage_event_hash_id` is the same for both stages. The `stage_event_hash_id` reduces
the amount of data the backend collects and plays a vital role in database partitioning.
We expect value streams and stages to be rarely changed. When stages (start and end events) are
changed, the aggregated data gets stale. This is fixed by the periodical aggregation occurring
every day.
### Feature availability
The aggregated VSA feature is available on the group and project level however, the aggregated
backend is only available for Premium and Ultimate customers due to data storage and data
computation costs. Storing de-normalized, aggregated data requires significant disk space.
## Aggregated value stream analytics architecture
The main idea behind the aggregated VSA backend is separation: VSA database tables and queries do
not use the core domain models directly (Issue, MergeRequest). This allows us to scale and
optimize VSA independently from the other parts of the application.
The architecture consists of two main mechanisms:
- Periodical data collection and loading (happens in the background).
- Querying the collected data (invoked by the user).
### Data loading
The aggregated nature of VSA comes from the periodical data loading. The system queries the core
domain models to collect the stage and timestamp data. This data is periodically inserted into the
VSA database tables.
High-level overview for each top-level namespace with Premium or Ultimate license:
1. Load all stages in the group.
1. Iterate over the issues and merge requests records.
1. Based on the stage configurations (start and end event identifiers) collect the timestamp data.
1.`INSERT` or `UPDATE` the data into the VSA database tables.
The data loading is implemented within the [`Analytics::CycleAnalytics::DataLoaderService`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/analytics/cycle_analytics/data_loader_service.rb)
class. There are groups containing a lot of data, so to avoid overloading the primary database,
the service performs operations in batches and enforces strict application limits:
- Load records in batches.
- Insert records in batches.
- Stop processing when a limit is reached, schedule a background job to continue the processing
later.
- Continue processing data from a specific point.
As of GitLab 14.7, the data loading is done manually. Once the feature is ready, the service will
be invoked periodically by the system via a cron job (this part is not implemented yet).
#### Record iteration
The batched iteration is implemented with the
[efficient IN operator](../database/efficient_in_operator_queries.md). The background job scans
all issues and merge request records in the group hierarchy ordered by the `updated_at` and the
`id` columns. For already aggregated groups, the `DataLoaderService` continues the aggregation
from a specific point which saves time.
Collecting the timestamp data happens on every iteration. The `DataLoaderService` determines which
stage events are configured within the group hierarchy and builds a query that selects the
required timestamps. The stage record knows which events are configured and the events know how to
select the timestamp columns.
Example for collected stage events: merge request merged, merge request created, merge request
closed
Generated SQL query for loading the timestamps:
```sql
SELECT
-- the list of columns depends on the configured stages