Merge branch 'docs/mk-add-planned-failover-doc' into 'master'

Add GitLab Geo Planned Failover doc Closes #4397 See merge request gitlab-org/gitlab-ee!4392

Merge branch 'docs/mk-add-planned-failover-doc' into 'master'
Add GitLab Geo Planned Failover doc Closes #4397 See merge request gitlab-org/gitlab-ee!4392
b9edd4e8 · Stan Hu · 8b200096 · ddc26b7d · b9edd4e8 · b9edd4e8
Commit b9edd4e8 authored Feb 06, 2018 by Stan Hu
Show whitespace changes
Inline Side-by-side

Showing with 63 additions and 14 deletions

doc/gitlab-geo/disaster-recovery.md doc/gitlab-geo/disaster-recovery.md +29 -14

doc/gitlab-geo/planned-failover.md doc/gitlab-geo/planned-failover.md +34 -0

No files found.
--- a/doc/gitlab-geo/disaster-recovery.md
+++ b/doc/gitlab-geo/disaster-recovery.md
@@ -18,9 +18,20 @@ This process promotes a secondary Geo replica to a primary. To regain
 geographical redundancy as quickly as possible, you should add a new secondary
 immediately after following these instructions.

-### Step 1. Promoting a secondary Geo replica
+### Step 1. Allow replication to finish if possible

-1. SSH into your **primary** to stop and disable GitLab.
+If the secondary is still replicating data from the primary, follow
+[the Planned Failover doc](planned-failover.md) as closely as possible in
+order to avoid unnecessary data loss.
+
+### Step 2. Permanently disable the primary
+
+If an outage on your primary happens, you should do everything possible to
+avoid a split-brain situation where writes can occur to two different GitLab
+instances, complicating recovery efforts. So to prepare for the failover, we
+must disable the primary.
+
+1. SSH into your **primary** to stop and disable GitLab, if possible.

    ```bash
    sudo gitlab-ctl stop
@@ -41,16 +52,20 @@ immediately after following these instructions.
    yum remove gitlab-ee
    ```

-    Preventing the original primary from coming back online during this process
-    is necessary prevent data from being mistakenly added to it. Any data added
-    after the failover process has begun will **not** be be replicated to the
-    newly promoted primary.
+1. If you do not have SSH access to your primary, take the machine offline and
+    prevent it from rebooting by any means at your disposal.
+
+    Since there are many ways you may prefer to accomplish this, we will avoid a
+    single recommendation. You may need to:
+
+    * Reconfigure load balancers
+    * Change DNS records (e.g. point the primary DNS record to the secondary node in order to stop usage of the primary)
+    * Stop virtual servers
+    * Block traffic through a firewall
+    * Revoke object storage permissions from the primary
+    * Physically disconnect a machine

-    If you do not have SSH access to your primary, take the machine offline and
-    prevent it from rebooting by any means at your disposal. Depending on the
-    nature of your primary, this may mean physically disconnecting the machine,
-    stopping a virtual server, reconfiguring load balancers, or changing DNS
-    records (see next step).
+### Step 3. Promoting a secondary Geo replica

 1. SSH in to your **secondary** and login as root:

@@ -69,7 +84,7 @@ immediately after following these instructions.

    A new secondary should not be added at this time. If you want to add a new
    secondary, do this after you have completed the entire process of promoting
-    the secondary to the primary .
+    the secondary to the primary.

 1. Promote the secondary to primary. Execute:

@@ -81,7 +96,7 @@ immediately after following these instructions.
   previously for the secondary.
 1. Success! The secondary has now been promoted to primary.

-### Step 2. (Optional) Updating the primary domain's DNS record
+### Step 4. (Optional) Updating the primary domain's DNS record

 Updating the DNS records for the primary domain to point to the secondary
 will prevent the need to update all references to the primary domain to the
@@ -123,7 +138,7 @@ secondary domain, like changing Git remotes and API URLs.
    If you updated the DNS records for the primary domain, these changes may
    not have yet propagated depending on the previous DNS records TTL.

-### Step 3. (Optional) Add secondary Geo replicas to a promoted primary
+### Step 5. (Optional) Add secondary Geo replicas to a promoted primary

 Promoting a secondary to primary using the process above does not enable
 GitLab Geo on the new primary.

--- a/doc/gitlab-geo/planned-failover.md
+++ b/doc/gitlab-geo/planned-failover.md
+# GitLab Geo Planned Failover
+
+A planned failover is similar to a disaster recovery scenario, except you are able
+to notify users of the maintenance window, and allow data to finish replicating to
+secondaries.
+
+Please read this entire document as well as
+[GitLab Geo Disaster Recovery](disaster-recovery.md) before proceeding.
+
+### Notify users of scheduled maintenance
+
+1. On the primary, in Admin Area > Messages, add a broadcast message.
+
+    Check Admin Area > Geo Nodes to estimate how long it will take to finish syncing.
+
+    ```
+    We are doing scheduled maintenance at XX:XX UTC, expected to take less than 1 hour.
+    ```
+
+1. On the secondary, you may need to clear the cache for the broadcast message to show up.
+
+### Block primary traffic
+
+1. At the scheduled time, using your cloud provider or your node's firewall, block HTTP and SSH traffic to/from the primary except for your IP and the secondary's IP.
+
+### Allow replication to finish as much as possible
+
+1. On the secondary, navigate to Admin Area > Geo Nodes and wait until all replication progress is 100% on the secondary "Current node".
+
+1. Navigate to Admin Area > Monitoring > Background Jobs > Queues and wait until the "geo" queues drop ideally to 0.
+
+### Promote the secondary
+
+1. Finally, follow [GitLab Geo Disaster Recovery](disaster-recovery.md) to promote the secondary to a primary.