-
Yorick Peterse authored
The table merge_request_diff_commits stores the names and Emails of commit authors and committers. This data is stored as-is. This means that if I push 10 commits, GitLab stores my name and Email address 20 times: 10 times for the author details, and 10 times for the committer details. This commit adds a set of migrations and code changes to resolve this problem. Instead of storing names and Emails for every occurrence, we'll store a unique set of names and Emails in a separate table. The table merge_request_context_commits in turn has two new columns: commit_author_id and committer_id. When creating rows for merge_request_context_commits, we take the author and committer details and try to find an existing row in the newly added table. If no row exists, one is created. The resulting setup is such that given a name and Email pair of (X, Y), we only store a single occurrence of that pair. This reduces the amount of space necessary to store all this information. Based on our findings in https://gitlab.com/gitlab-org/gitlab/-/issues/331823, we estimate that over 95% of the data currently stored is duplicate data. For GitLab.com, we estimate this will translate to roughly 500 GB of data we no longer need to store. The migration process for GitLab.com will likely take around two weeks, based on our estimates discussed in https://gitlab.com/gitlab-org/gitlab/-/issues/331823#note_595865070. The exact time may differ based on how lucky (or not) we get, and the exact number of rows that have to be migrated. See the following issues for more information: - https://gitlab.com/gitlab-org/gitlab/-/issues/331523#note_583654940 - https://gitlab.com/gitlab-org/gitlab/-/issues/331823 Changelog: performance
43a3ce63