Commit b9edd4e8 authored by Stan Hu's avatar Stan Hu

Merge branch 'docs/mk-add-planned-failover-doc' into 'master'

Add GitLab Geo Planned Failover doc

Closes #4397

See merge request gitlab-org/gitlab-ee!4392
parents 8b200096 ddc26b7d
......@@ -18,9 +18,20 @@ This process promotes a secondary Geo replica to a primary. To regain
geographical redundancy as quickly as possible, you should add a new secondary
immediately after following these instructions.
### Step 1. Promoting a secondary Geo replica
### Step 1. Allow replication to finish if possible
1. SSH into your **primary** to stop and disable GitLab.
If the secondary is still replicating data from the primary, follow
[the Planned Failover doc](planned-failover.md) as closely as possible in
order to avoid unnecessary data loss.
### Step 2. Permanently disable the primary
If an outage on your primary happens, you should do everything possible to
avoid a split-brain situation where writes can occur to two different GitLab
instances, complicating recovery efforts. So to prepare for the failover, we
must disable the primary.
1. SSH into your **primary** to stop and disable GitLab, if possible.
```bash
sudo gitlab-ctl stop
......@@ -41,16 +52,20 @@ immediately after following these instructions.
yum remove gitlab-ee
```
Preventing the original primary from coming back online during this process
is necessary prevent data from being mistakenly added to it. Any data added
after the failover process has begun will **not** be be replicated to the
newly promoted primary.
1. If you do not have SSH access to your primary, take the machine offline and
prevent it from rebooting by any means at your disposal.
Since there are many ways you may prefer to accomplish this, we will avoid a
single recommendation. You may need to:
* Reconfigure load balancers
* Change DNS records (e.g. point the primary DNS record to the secondary node in order to stop usage of the primary)
* Stop virtual servers
* Block traffic through a firewall
* Revoke object storage permissions from the primary
* Physically disconnect a machine
If you do not have SSH access to your primary, take the machine offline and
prevent it from rebooting by any means at your disposal. Depending on the
nature of your primary, this may mean physically disconnecting the machine,
stopping a virtual server, reconfiguring load balancers, or changing DNS
records (see next step).
### Step 3. Promoting a secondary Geo replica
1. SSH in to your **secondary** and login as root:
......@@ -69,7 +84,7 @@ immediately after following these instructions.
A new secondary should not be added at this time. If you want to add a new
secondary, do this after you have completed the entire process of promoting
the secondary to the primary .
the secondary to the primary.
1. Promote the secondary to primary. Execute:
......@@ -81,7 +96,7 @@ immediately after following these instructions.
previously for the secondary.
1. Success! The secondary has now been promoted to primary.
### Step 2. (Optional) Updating the primary domain's DNS record
### Step 4. (Optional) Updating the primary domain's DNS record
Updating the DNS records for the primary domain to point to the secondary
will prevent the need to update all references to the primary domain to the
......@@ -123,7 +138,7 @@ secondary domain, like changing Git remotes and API URLs.
If you updated the DNS records for the primary domain, these changes may
not have yet propagated depending on the previous DNS records TTL.
### Step 3. (Optional) Add secondary Geo replicas to a promoted primary
### Step 5. (Optional) Add secondary Geo replicas to a promoted primary
Promoting a secondary to primary using the process above does not enable
GitLab Geo on the new primary.
......
# GitLab Geo Planned Failover
A planned failover is similar to a disaster recovery scenario, except you are able
to notify users of the maintenance window, and allow data to finish replicating to
secondaries.
Please read this entire document as well as
[GitLab Geo Disaster Recovery](disaster-recovery.md) before proceeding.
### Notify users of scheduled maintenance
1. On the primary, in Admin Area > Messages, add a broadcast message.
Check Admin Area > Geo Nodes to estimate how long it will take to finish syncing.
```
We are doing scheduled maintenance at XX:XX UTC, expected to take less than 1 hour.
```
1. On the secondary, you may need to clear the cache for the broadcast message to show up.
### Block primary traffic
1. At the scheduled time, using your cloud provider or your node's firewall, block HTTP and SSH traffic to/from the primary except for your IP and the secondary's IP.
### Allow replication to finish as much as possible
1. On the secondary, navigate to Admin Area > Geo Nodes and wait until all replication progress is 100% on the secondary "Current node".
1. Navigate to Admin Area > Monitoring > Background Jobs > Queues and wait until the "geo" queues drop ideally to 0.
### Promote the secondary
1. Finally, follow [GitLab Geo Disaster Recovery](disaster-recovery.md) to promote the secondary to a primary.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment