Merge branch '337283-document-patterns-for-removing-joins' into 'master'

Add docs for how to fix cross-joins to ci*_ tables See merge request gitlab-org/gitlab!67203

Merge branch '337283-document-patterns-for-removing-joins' into 'master'
Add docs for how to fix cross-joins to ci*_ tables See merge request gitlab-org/gitlab!67203
a3357f19 · Dylan Griffith · 0620c513 · a6d7ebf5 · a3357f19
Commit a3357f19 authored Aug 03, 2021 by Dylan Griffith
Hide whitespace changes
Inline Side-by-side

Showing with 146 additions and 0 deletions

doc/development/database/multiple_databases.md doc/development/database/multiple_databases.md +146 -0

No files found.
--- a/doc/development/database/multiple_databases.md
+++ b/doc/development/database/multiple_databases.md
@@ -99,3 +99,149 @@ and their tables must be placed in two directories for now:
 - `db/ci_migrate`

 We aim to keep the schema for both tables the same across both databases.
+
+### Removing joins between `ci_*` and non `ci_*` tables
+
+We are planning on moving all the `ci_*` tables to a separate database so
+referencing `ci_*` tables with other tables will not be possible. This means,
+that using any kind of `JOIN` in SQL queries will not work. We have identified
+already many such examples that need to be fixed in
+<https://gitlab.com/groups/gitlab-org/-/epics/6289> .
+
+The following are some real examples that have resulted from this and these
+patterns may apply to future cases.
+
+#### Remove the code
+
+The simplest solution we've seen several times now has been an existing scope
+that is unused. This is the easiest example to fix. So the first step is to
+investigate if the code is unused and then simply remove it. These are some
+real examples:
+
+- <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67162>
+- <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66714>
+- <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66503>
+
+There may be more examples where the code is used, but we can evaluate
+if we need it or if the feature should behave this way.
+Before complicating things by adding new columns and tables,
+consider if you can simplify the solution and still meet the requirements.
+One case being evaluated involves changing how certain `UsageData` is
+calculated to remove a join query in
+<https://gitlab.com/gitlab-org/gitlab/-/issues/336170>. This is a good candidate
+to evaluate, because `UsageData` is not critical to users and it may be possible
+to get a similarly useful metric with a simpler approach. Alternatively we may
+find that nobody is using these metrics, so we can remove them.
+
+#### De-normalize some foreign key to the table
+
+De-normalization refers to adding redundant precomputed (duplicated) data to
+a table to simplify certain queries or to improve performance. In this
+case, it can be useful when you are doing a join that involves three tables, where
+you are joining through some intermediate table.
+
+Generally when modeling a database schema, a "normalized" structure is
+preferred because of the following reasons:
+
+- Duplicate data uses extra storage.
+- Duplicate data needs to be kept in sync.
+
+Sometimes normalized data is less performant so de-normalization has been a
+common technique GitLab has used to improve the performance of database queries
+for a while. The above problems are mitigated when the following conditions are
+met:
+
+1. There isn't much data (for example, it's just an integer column).
+1. The data does not update often (for example, the `project_id` column is almost
+   never updated for most tables).
+
+One example we found was the `security_scans` table. This table has a foreign
+key `security_scans.build_id` which allows you to join to the build. Therefore
+you could join to the project like so:
+
+```sql
+select projects.* from security_scans
+inner join ci_builds on security_scans.build_id = ci_builds.id
+inner join projects on ci_builds.project_id = projects.id
+```
+
+The problem with this query is that `ci_builds` is in a different database
+from the other two tables.
+
+The solution in this case is to add the `project_id` column to
+`security_scans`. This doesn't use much extra storage, and due to the way
+these features work, it's never updated (a build never moves projects).
+
+This simplified the query to:
+
+```sql
+select projects.* from security_scans
+inner join projects on security_scans.project_id = projects.id
+```
+
+This also improves performance because you don't need to join through an extra
+table.
+
+You can see this approach implemented in
+<https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66963> . This MR also
+de-normalizes `pipeline_id` to fix a similar query.
+
+#### De-normalize into an extra table
+
+Sometimes the previous de-normalization (adding an extra column) doesn't work for
+your specific case. This may be due to the fact that your data is not 1:1, or
+because the table you're adding to is already too wide (for example, the `projects`
+table shouldn't have more columns added).
+
+In this case you may decide to just store the extra data in a separate table.
+
+One example where this approach is being used was to implement the
+`Project.with_code_coverage` scope. This scope was essentially used to narrow
+down a list of projects to only those that have at one point in time used code
+coverage features. This query (simplified) was:
+
+```sql
+select projects.* from projects
+inner join ci_daily_build_group_report_results on ci_daily_build_group_report_results.project_id = projects.id
+where ((data->'coverage') is not null)
+and ci_daily_build_group_report_results.default_branch = true
+group by projects.id
+```
+
+This work is still in progress but the current plan is to introduce a new table
+called `projects_with_ci_feature_usage` which has 2 columns `project_id` and
+`ci_feature`. This table would be written to the first time a project creates a
+`ci_daily_build_group_report_results` for code coverage. Therefore the new
+query would be:
+
+```sql
+select projects.* from projects
+inner join projects_with_ci_feature_usage on projects_with_ci_feature_usage.project_id = projects.id
+where projects_with_ci_feature_usage.ci_feature = 'code_coverage'
+```
+
+The above example uses as a text column for simplicity but we should probably
+use an [enum](../creating_enums.md) to save space.
+
+The downside of this new design is that this may need to be
+updated (removed if the `ci_daily_build_group_report_results` is deleted).
+Depending on your domain, however, this may not be necessary because deletes are
+edge cases or impossible, or because the user impact of seeing the project on the
+list page may not be problematic. It's also possible to implement the
+logic to delete these rows if or whenever necessary in your domain.
+
+Finally, this de-normalization and new query also improves performance because
+it does less joins and needs less filtering.
+
+#### Summary of cross-join removal patterns
+
+A quick checklist for fixing a specific join query would be:
+
+1. Is the code even used? If not just remove it
+1. If the code is used, then is this feature even used or can we implement the
+   feature in a simpler way and still meet the requirements. Always prefer the
+   simplest option.
+1. Can we remove the join if we de-normalize the data you are joining to by
+   adding a new column
+1. Can we remove the join by adding a new table in the correct database that
+   replicates the minimum data needed to do the join