Documentation for the aggregated VSA backend

This change adds documentation for the aggregated VSA backend.

Documentation for the aggregated VSA backend
This change adds documentation for the aggregated VSA backend.
ca65185c · Adam Hegyi · 9df0ef28 · ca65185c · ca65185c
Commit ca65185c authored Jan 17, 2022 by Adam Hegyi
2 changed files
--- a/doc/development/value_stream_analytics/img/object_hierarchy_example_V14_10.png
+++ b/doc/development/value_stream_analytics/img/object_hierarchy_example_V14_10.png
--- a/doc/development/value_stream_analytics/value_stream_analytics_aggregated_backend.md
+++ b/doc/development/value_stream_analytics/value_stream_analytics_aggregated_backend.md
+---
+stage: Manage
+group: Optimize
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Aggregated Value Stream Analytics
+
+> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/335391) in GitLab 14.7.
+
+DISCLAIMER:
+This page contains information related to upcoming products, features, and functionality.
+It is important to note that the information presented is for informational purposes only.
+Please do not rely on this information for purchasing or planning purposes.
+As with all projects, the items mentioned on this page are subject to change or delay.
+The development, release, and timing of any products, features, or functionality remain at the
+sole discretion of GitLab Inc.
+
+This page provides a high-level overview of the aggregated backend for
+Value Stream Analytics (VSA).
+
+## Current Status
+
+As of 14.8 the aggregated VSA backend is used only in the `gitlab-org` group, for testing purposes
+. We plan to gradually roll it out in the next major release (15.0) for the rest of the groups.
+
+## Motivation
+
+The aggregated backend aims to solve the performance limitations of the VSA feature and set it up
+for long-term growth.
+
+Our main database is not prepared for analytical workloads. Executing long-running queries can
+affect the reliability of the application. For large groups, the current
+implementation (old backend) is slow and, in some cases, doesn't even load due to the configured
+statement timeout (15s).
+
+The database queries in the old backend use the core domain models directly through
+`IssuableFinders` classes: ([MergeRequestsFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/merge_requests_finder.rb) and [IssuesFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/issues_finder.rb)).
+With the requested change of the [date range filters](https://gitlab.com/groups/gitlab-org/-/epics/6046),
+this approach was no longer viable from the performance point of view.
+
+Benefits of the aggregated VSA backend:
+
+- Simpler database queries (fewer JOINs).
+- Faster aggregations, only a single table is accessed.
+- Possibility to introduce further aggregations for improving the first page load time.
+- Better performance for large groups (with many sub-groups, projects, issues and, merge requests).
+- Ready for database decomposition. The VSA related database tables could live in a separate
+database with a minimal development effort.
+- Ready for keyset pagination which can be useful for exporting the data.
+- Possibility to implement more complex event definitions.
+  - For example, the start event can be two timestamp columns where the earliest value would be
+  used by the system.
+  - Example: `MIN(issues.created_at, issues.updated_at)`
+
+## How does Value Stream Analytics work?
+
+Value Stream Analytics calculates the duration between two timestamp columns or timestamp
+expressions and runs various aggregations on the data.
+
+Examples:
+
+- Duration between the Merge Request creation time and Merge Request merge time.
+- Duration between the Issue creation time and Issue close time.
+
+This duration is exposed in various ways:
+
+- Aggregation: median, average
+- Listing: list the duration for individual Merge Request and Issue records
+
+Apart from the durations, we expose the record count within a stage.
+
+### Stages
+
+A stage represents an event pair (start and end events) with additional metadata, such as the name
+of the stage. Stages are configurable by the user within the pairing rules defined in the backend.
+
+**Example stage: Code Review**
+
+- Start event identifier: Merge Request creation time
+- Start event column: uses the `merge_requests.created_at` timestamp column.
+- End event identifier: Merge Request merge time
+- End event column: uses the `merge_request_metrics.merged_at` timestamp column.
+- Stage event hash ID: a calculated hash for the pair of start and end event identifiers.
+  - If two stages have the same configuration of start and end events, then their stage event hash
+  IDs are identical.
+  - The stage event hash ID is later used to store the aggregated data in partitioned database tables.
+
+### Value streams
+
+Value streams are container objects for the stages. There can be multiple value streams per group
+or project focusing on different aspects of the Dev Ops lifecycle.
+
+### Example configuration
+
+![vsa object hierarchy example](img/object_hierarchy_example_V14_10.png)
+
+In this example, there are two independent value streams set up for two teams that are using
+different development workflows within the `Test Group` (top-level namespace).
+
+The first value stream uses standard timestamp-based events for defining the stages. The second
+value stream uses label events.
+
+Each value stream and stage item from the example will be persisted in the database. Notice that
+the `Deployment` stage is identical for both value streams; that means that the underlying
+`stage_event_hash_id` is the same for both stages. The `stage_event_hash_id` reduces
+the amount of data the backend collects and plays a vital role in database partitioning.
+
+We expect value streams and stages to be rarely changed. When stages (start and end events) are
+changed, the aggregated data gets stale. This is fixed by the periodical aggregation occurring
+every day.
+
+### Feature availability
+
+The aggregated VSA feature is available on the group and project level however, the aggregated
+backend is only available for Premium and Ultimate customers due to data storage and data
+computation costs. Storing de-normalized, aggregated data requires significant disk space.
+
+## Aggregated value stream analytics architecture
+
+The main idea behind the aggregated VSA backend is separation: VSA database tables and queries do
+not use the core domain models directly (Issue, MergeRequest). This allows us to scale and
+optimize VSA independently from the other parts of the application.
+
+The architecture consists of two main mechanisms:
+
+- Periodical data collection and loading (happens in the background).
+- Querying the collected data (invoked by the user).
+
+### Data loading
+
+The aggregated nature of VSA comes from the periodical data loading. The system queries the core
+domain models to collect the stage and timestamp data. This data is periodically inserted into the
+VSA database tables.
+
+High-level overview for each top-level namespace with Premium or Ultimate license:
+
+1. Load all stages in the group.
+1. Iterate over the issues and merge requests records.
+1. Based on the stage configurations (start and end event identifiers) collect the timestamp data.
+1. `INSERT` or `UPDATE` the data into the VSA database tables.
+
+The data loading is implemented within the [`Analytics::CycleAnalytics::DataLoaderService`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/analytics/cycle_analytics/data_loader_service.rb)
+class. There are groups containing a lot of data, so to avoid overloading the primary database,
+the service performs operations in batches and enforces strict application limits:
+
+- Load records in batches.
+- Insert records in batches.
+- Stop processing when a limit is reached, schedule a background job to continue the processing
+later.
+- Continue processing data from a specific point.
+
+As of GitLab 14.7, the data loading is done manually. Once the feature is ready, the service will
+be invoked periodically by the system via a cron job (this part is not implemented yet).
+
+#### Record iteration
+
+The batched iteration is implemented with the
+[efficient IN operator](../database/efficient_in_operator_queries.md). The background job scans
+all issues and merge request records in the group hierarchy ordered by the `updated_at` and the
+`id` columns. For already aggregated groups, the `DataLoaderService` continues the aggregation
+from a specific point which saves time.
+
+Collecting the timestamp data happens on every iteration. The `DataLoaderService` determines which
+stage events are configured within the group hierarchy and builds a query that selects the
+required timestamps. The stage record knows which events are configured and the events know how to
+select the timestamp columns.
+
+Example for collected stage events: merge request merged, merge request created, merge request
+closed
+
+Generated SQL query for loading the timestamps:
+
+```sql
+SELECT
+  -- the list of columns depends on the configured stages
+  "merge_request_metrics"."merged_at",
+  "merge_requests"."created_at",
+  "merge_request_metrics"."latest_closed_at"
+  FROM "merge_requests"
+  LEFT OUTER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
+  WHERE "merge_requests"."id" IN (1, 2, 3, 4) -- ids are coming from the batching query
+```
+
+The `merged_at` column is located in a separate table (`merge_request_metrics`). The
+`Gitlab::Analytics::CycleAnalytics::StagEvents::MergeRequestMerged` class adds itself to a scope
+for loading the timestamp data without affecting the number of rows (uses `LEFT JOIN`). This
+behavior is implemented for each `StageEvent` class with the `include_in` method.
+
+The data collection query works on the event level. It extracts the event timestamps from the
+stages and ensures that we don't collect the same data multiple times. The events mentioned above
+could come from the following stage configuration:
+
+- merge request created - merge request merged
+- merge request created - merge request closed
+
+Other combinations might be also possible, but we prevent the ones that make no sense, for example:
+
+- merge request merged - merge request created
+
+Creation time always happens first, so this stage always reports negative duration.
+
+#### Data scope
+
+The data collection scans and processes all issues and merge requests records in the group
+hierarchy, starting from the top-level group. This means that if a group only has one value stream
+in a sub-group, we nevertheless collect data of all issues and merge requests in the hierarchy of
+this group. This aims to simplify the data collection mechanism. Moreover, data research shows
+that most group hierarchies have their stages configured on the top level.
+
+During the data collection process, the collected timestamp data is transformed into rows. For
+each configured stage, if the start event timestamp is present, the system inserts or updates one
+event record. This allows us to determine the upper limit of the inserted rows per group by
+counting all issues and merge requests and multiplying the sum by the stage count.
+
+#### Data consistency concerns
+
+Due to the async nature of the data collection, data consistency issues are bound to happen. This
+is a trade-off that makes the query performance significantly faster. We think that for analytical
+workload a slight lag in the data is acceptable.
+
+Before the rollout we plan to implement some indicators on the VSA page that shows the most
+recent backend activities. For example, indicators that show the last data collection timestamp
+and the last consistency check timestamp.
+
+#### Database structure
+
+VSA collects data for the following domain models: `Issue` and `MergeRequest`. To keep the
+aggregated data separated, we use two additional database tables:
+
+- `analytics_cycle_analytics_issue_stage_events`
+- `analytics_cycle_analytics_merge_request_stage_events`
+
+Both tables are hash partitioned by the `stage_event_hash_id`. Each table uses 32 partitions. It's
+an arbitrary number and it could be changed. Important is to keep the partitions under 100GB in
+size (which gives the feature a lot of headroom).
+
+|Column|Description|
+|-|-|
+|`stage_event_hash_id`|partitioning key|
+|`merge_request_id` or `issue_id`|reference to the domain record (Issuable)|
+|`group_id`|reference to the group (de-normalization)|
+|`project_id`|reference to the project|
+|`milestone_id`|duplicated data from the domain record table|
+|`author_id`|duplicated data from the domain record table|
+|`state_id`|duplicated data from the domain record table|
+|`start_event_timestamp`|timestamp derived from the stage configuration|
+|`end_event_timestamp`|timestamp derived from the stage configuration|
+
+With accordance to the data separation requirements, the table doesn't have any foreign keys. The
+consistency is ensured by a background job (eventually consistent).
+
+### Data querying
+
+The base query always includes the following filters:
+
+- `stage_event_hash_id` - partition key
+- `project_id` or `group_id` - depending if it's a project or group level query
+- `end_event_timestamp` - date range filter (last 30 days)
+
+Example: Selecting review stage duration for the GitLab project
+
+```sql
+SELECT end_event_timestamp - start_event_timestamp
+FROM analytics_cycle_analytics_merge_request_stage_events
+WHERE
+stage_event_hash_id = 16 AND -- hits a specific partition
+project_id = 278964 AND
+end_event_timestamp > '2022-01-01' AND end_event_timestamp < '2022-01-30'
+```
+
+#### Query generation
+
+The query backend is hidden behind the same interface that the old backend implementation uses.
+Thanks to this, we can easily switch between the old and new query backends.
+
+- `DataCollector`: entrypoint for querying VSA data
+  - `BaseQueryBuilder`: provides the base `ActiveRecord` scope (filters are applied here).
+  - `average`: average aggregation.
+  - `median`: median aggregation.
+  - `count`: row counting.
+  - `records`: list of issue or merge request records.
+
+#### Filters
+
+VSA supports various filters on the base query. Most of the filters require no additional JOINs:
+
+|Filter name|Description|
+|-|-|
+|`milestone_title`|The backend translates it to `milestone_id` filter|
+|`author_username`|The backend translates it to `author_id` filter|
+|`project_ids`|Only used on the group-level|
+
+Exceptions: these filters are applied on other tables which means we `JOIN` them.
+
+|Filter name|Description|
+|-|-|
+|`label_name`|Array filter, using the `label_links` table|
+|`assignee_username`|Array filter, using the `*_assignees` table|
+
+To fully decompose the database, the required ID values would need to be replicated in the VSA
+database tables. This change could be implemented using array columns.
+
+### Endpoints
+
+The feature uses private JSON APIs for delivering the data to the frontend. On the first page load
+, the following requests are invoked:
+
+- Initial HTML page load which is mostly empty. Some configuration data is exposed via `data`
+attributes.
+- `value_streams` - Load the available value streams for the given group.
+- `stages` - Load the stages for the currently selected value stream.
+- `median` - For each stage, request the median duration.
+- `count` - For each stage, request the number of items in the stage (this is a
+[limit count](../merge_request_performance_guidelines.md#badge-counters), maximum 1000 rows).
+- `average_duration_chart` - Data for the duration chart.
+- `summary`, `time_summary` - Top-level aggregations, most of the metrics are using different APIs/
+finders and not invoking the aggregated backend.
+
+When clicking on a specific stage, the `records` endpoint is invoked, which returns the related
+records (paginated) for the chosen stage in a specific order.
+
+### Database decomposition
+
+By separating the query logic from the main application code, the feature is ready for database
+decomposition. If we decide that VSA requires a separate database instance, then moving the
+aggregated tables can be accomplished with little effort.
+
+A different database technology could also be used to further improve the performance of the
+feature, for example [Timescale DB](https://www.timescale.com).