• Yorick Peterse's avatar
    Reduce space needed for merge request diff commits · 43a3ce63
    Yorick Peterse authored
    The table merge_request_diff_commits stores the names and Emails of
    commit authors and committers. This data is stored as-is. This means
    that if I push 10 commits, GitLab stores my name and Email address 20
    times: 10 times for the author details, and 10 times for the committer
    details.
    
    This commit adds a set of migrations and code changes to resolve this
    problem. Instead of storing names and Emails for every occurrence, we'll
    store a unique set of names and Emails in a separate table. The table
    merge_request_context_commits in turn has two new columns:
    commit_author_id and committer_id.
    
    When creating rows for merge_request_context_commits, we take the author
    and committer details and try to find an existing row in the newly
    added table. If no row exists, one is created.
    
    The resulting setup is such that given a name and Email pair of (X, Y),
    we only store a single occurrence of that pair. This reduces the amount
    of space necessary to store all this information. Based on our findings
    in https://gitlab.com/gitlab-org/gitlab/-/issues/331823, we estimate
    that over 95% of the data currently stored is duplicate data. For
    GitLab.com, we estimate this will translate to roughly 500 GB of data we
    no longer need to store.
    
    The migration process for GitLab.com will likely take around two weeks,
    based on our estimates discussed in
    https://gitlab.com/gitlab-org/gitlab/-/issues/331823#note_595865070. The
    exact time may differ based on how lucky (or not) we get, and the exact
    number of rows that have to be migrated.
    
    See the following issues for more information:
    
    - https://gitlab.com/gitlab-org/gitlab/-/issues/331523#note_583654940
    - https://gitlab.com/gitlab-org/gitlab/-/issues/331823
    
    Changelog: performance
    43a3ce63
schema_spec.rb 12 KB