Add documentation on adding new Redis instances

d59d4deb · Sean McGivern · Marcel Amirault · 0ec328ab · d59d4deb · d59d4deb
Commit d59d4deb authored Sep 23, 2021 by Sean McGivern Committed by Marcel Amirault Sep 23, 2021
Showing with 137 additions and 0 deletions

doc/development/index.md doc/development/index.md +1 -0

doc/development/redis.md doc/development/redis.md +5 -0

doc/development/redis/new_redis_instance.md doc/development/redis/new_redis_instance.md +131 -0

No files found.
--- a/doc/development/index.md
+++ b/doc/development/index.md
@@ -215,6 +215,7 @@ the [reviewer values](https://about.gitlab.com/handbook/engineering/workflow/rev
 - [How to dump production data to staging](db_dump.md)
 - [Geo development](geo.md)
 - [Redis guidelines](redis.md)
+  - [Adding a new Redis instance](redis/new_redis_instance.md)
 - [Sidekiq guidelines](sidekiq_style_guide.md) for working with Sidekiq workers
 - [Working with Gitaly](gitaly.md)
 - [Elasticsearch integration docs](elasticsearch.md)

--- a/doc/development/redis.md
+++ b/doc/development/redis.md
@@ -6,11 +6,14 @@ info: To determine the technical writer assigned to the Stage/Group associated w

 # Redis guidelines

+## Redis instances
+
 GitLab uses [Redis](https://redis.io) for the following distinct purposes:

 - Caching (mostly via `Rails.cache`).
 - As a job processing queue with [Sidekiq](sidekiq_style_guide.md).
 - To manage the shared application state.
+- To store CI trace chunks.
 - As a Pub/Sub queue backend for ActionCable.

 In most environments (including the GDK), all of these point to the same
@@ -29,6 +32,8 @@ more often than it is read.
 If [Geo](geo.md) is enabled, each Geo node gets its own, independent Redis
 database.

+We have [development documentation on adding a new Redis instance](redis/new_redis_instance.md).
+
 ## Key naming

 Redis is a flat namespace with no hierarchy, which means we must pay attention

--- a/doc/development/redis/new_redis_instance.md
+++ b/doc/development/redis/new_redis_instance.md
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Add a new Redis instance
+
+GitLab can make use of multiple [Redis instances](../redis.md#redis-instances).
+These instances are functionally partitioned so that, for example, we
+can store [CI trace chunks](../../administration/job_logs.md#incremental-logging-architecture)
+from one Redis instance while storing sessions in another.
+
+From time to time we might want to add a new Redis instance. Typically this will
+be a functional partition split from one of the existing instances such as the
+cache or shared state. This document describes an approach
+for adding a new Redis instance that handles existing data, based on
+prior examples:
+
+- [Dedicated Redis instance for Trace Chunk storage](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/462).
+
+This document does not cover the operational side of preparing and configuring
+the new Redis instance in detail, but the example epics do contain information
+on previous approaches to this.
+
+## Step 1: Support configuring the new instance
+
+Before we can switch any features to using the new instance, we have to support
+configuring it and referring to it in the codebase. We must support the
+main installation types:
+
+- Source installs (including development environments) - [example MR](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/62767)
+- Omnibus - [example MR](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5316)
+- Helm charts - [example MR](https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/2031)
+
+### Fallback instance
+
+In the application code, we need to define a fallback instance in case the new
+instance is not configured. For example, if a GitLab instance has already
+configured a separate shared state Redis, and we are partitioning data from the
+shared state Redis, our new instance's configuration should default to that of
+the shared state Redis when it's not present. Otherwise we could break instances
+that don't configure the new Redis instance as soon as it's available.
+
+You can [define a `.config_fallback` method](https://gitlab.com/gitlab-org/gitlab/-/blob/a75471dd744678f1a59eeb99f71fca577b155acd/lib/gitlab/redis/wrapper.rb#L69-87)
+in `Gitlab::Redis::Wrapper` (the base class for all Redis instances)
+that defines the instance to be used if this one is not configured. If we were
+adding a `Foo` instance that should fall back to `SharedState`, we can do that
+like this:
+
+```ruby
+module Gitlab
+  module Redis
+    class Foo < ::Gitlab::Redis::Wrapper
+      # The data we store on Foo used to be stored on SharedState.
+      def self.config_fallback
+        SharedState
+      end
+    end
+  end
+end
+```
+
+We should also add specs like those in
+[`trace_chunks_spec.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/spec/lib/gitlab/redis/trace_chunks_spec.rb)
+to ensure that this fallback works correctly.
+
+## Step 2: Support writing to and reading from the new instance
+
+When migrating to the new instance, we must account for cases where data is
+either on:
+
+- The 'old' (original) instance.
+- The new one that we have just added support for.
+
+As a result we may need to support reading from and writing to both
+instances, depending on some condition.
+
+The exact condition to use varies depending on the data to be migrated. For
+the trace chunks case above, there was already a database column indicating where the
+data was stored (as there are other storage options than Redis).
+
+This step may not apply if the data has a very short lifetime (a few minutes at most)
+and is not critical. In that case, we
+may decide that it is OK to incur a small amount of data loss and switch
+over through configuration only.
+
+If there is not a more natural way to mark where the data is stored, using a
+[feature flag](../feature_flags/index.md) may be convenient:
+
+- It does not require an application restart to take effect.
+- It applies to all application instances (Sidekiq, API, web, etc.) at
+  the same time.
+- It supports incremental rollout - ideally by actor (project, group,
+  user, etc.) - so that we can monitor for errors and roll back easily.
+
+## Step 3: Migrate the data
+
+We then need to configure the new instance for GitLab.com's production and
+staging environments. Hopefully it will be possible to test this change
+effectively on staging, to at least make sure that basic usage continues to
+work.
+
+After that is done, we can roll out the change to production. Ideally this would
+be in an incremental fashion, following the
+[standard incremental rollout](../feature_flags/controls.md#rolling-out-changes)
+documentation for feature flags.
+
+When we have been using the new instance 100% of the time in production for a
+while and there are no issues, we can proceed.
+
+## Step 4: clean up after the migration
+
+<!-- markdownlint-disable MD044 -->
+We may choose to keep the migration paths or remove them, depending on whether
+or not we expect self-managed instances to perform this migration.
+[gitlab-com/gl-infra/scalability#1131](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1131#note_603354746)
+contains a discussion on this topic for the trace chunks feature flag. It may
+be - as in that case - that we decide that the maintenance costs of supporting
+the migration code are higher than the benefits of allowing self-managed
+instances to perform this migration seamlessly, if we expect self-managed
+instances to cope without this functional partition.
+<!-- markdownlint-enable MD044 -->
+
+If we decide to keep the migration code:
+
+- We should document the migration steps.
+- If we used a feature flag, we should ensure it's an [ops type feature
+  flag](../feature_flags/index.md#ops-type), as these are long-lived flags.
+
+Otherwise, we can remove the flags and conclude the project.