Commit fa82064f authored by Achilleas Pipinellis's avatar Achilleas Pipinellis

Refactor DR docs

parent 70ed52c8
## Bring a demoted primary back online
# Bring a demoted primary node back online
After a fail-over, it is possible to fail back to the demoted primary to restore your original configuration.
This process consists of two steps: making old primary a secondary and promoting secondary to a primary.
After a failover, it is possible to fail back to the demoted primary to
restore your original configuration. This process consists of two steps:
### Configure the former primary to be a secondary
1. Making the old primary a secondary
1. Promoting a secondary to a primary
## Configure the former primary to be a secondary
Since the former primary will be out of sync with the current primary, the first
step is to bring the former primary up to date. There is one downside though, some uploads and repositories
that have been deleted during an idle period of a primary node, will not be deleted from the disk but the overall sync will be much faster. As an alternative, you can set up a [GitLab instance from scratch](https://docs.gitlab.com/ee/gitlab-geo/#setup-instructions) to workaround this downside.
step is to bring the former primary up to date. There is one downside though,
some uploads and repositories that have been deleted during an idle period of a
primary node, will not be deleted from the disk but the overall sync will be
much faster. As an alternative, you can set up a
[GitLab instance from scratch](../replication/index.md#setup-instructions) to
workaround this downside.
To bring the former primary up to date:
1. SSH into the former primary that has fallen behind.
1. Make sure all the services are up by running the command
1. SSH into the former primary that has fallen behind
1. Make sure all the services are up:
```bash
sudo gitlab-ctl start
```
Note: If you [disabled the primary permanently](index.md#step-2-permanently-disable-the-primary), you need to undo those steps now. For Debian/Ubuntu you just need to run `sudo systemctl enable gitlab-runsvdir`. For CentoOS 6, you need to install GitLab instance from scratch and setup it as a secondary node by following [Setup instructions](../replication/index.md#setup-instructions). In this case you don't need the step below.
NOTE: **Note:** If you [disabled the primary permanently](index.md#step-2-permanently-disable-the-primary),
you need to undo those steps now. For Debian/Ubuntu you just need to run
`sudo systemctl enable gitlab-runsvdir`. For CentOS 6, you need to install
the GitLab instance from scratch and setup it as a secondary node by
following the [setup instructions](../replication/index.md#setup-instructions).
In this case you don't need to follow the next step.
1. [Setup database replication](../replication/database.md). In this documentation, primary
refers to the current primary, and secondary refers to the former primary.
1. [Setup database replication](../replication/database.md). Note that in this
case, primary refers to the current primary, and secondary refers to the
former primary.
If you have lost your original primary, follow the
[setup instructions](../replication/index.md#setup-instructions) to set up a new secondary.
### Promote the secondary to primary
## Promote the secondary to primary
When initial replication is complete and the primary and secondary are closely in sync you can do a [Planned Failover](planned_fail_over.md)
When the initial replication is complete and the primary and secondary are
closely in sync, you can do a [planned failover](planned_failover.md).
### Restore the secondary node
## Restore the secondary node
If your objective is to have two nodes again, you need to bring your secondary node back online as well by repeating the first step ([Configure the former primary to be a secondary](#configure-the-former-primary-to-be-a-secondary)) for the secondary node.
If your objective is to have two nodes again, you need to bring your secondary
node back online as well by repeating the first step
([configure the former primary to be a secondary](#configure-the-former-primary-to-be-a-secondary))
for the secondary node.
# Disaster Recovery
> **Note:** Disaster Recovery for multi-secondary configurations is in
> **Alpha** development. Do not use this as your only Disaster Recovery
> strategy as you may lose data.
Geo replicates your database and your Git repositories. We will
support and replicate more data in the future, that will enable you to
fail-over with minimal effort, in a disaster situation.
failover with minimal effort, in a disaster situation.
See [Geo current limitations](../replication/index.md#current-limitations)
for more information.
## Promoting secondary Geo replica in single-secondary configuration
## Promoting secondary Geo replica in single-secondary configurations
We don't currently provide an automated way to promote a Geo replica and do a
fail-over, but you can do it manually if you have `root` access to the machine.
failover, but you can do it manually if you have `root` access to the machine.
This process promotes a secondary Geo replica to a primary. To regain
geographical redundancy as quickly as possible, you should add a new secondary
......@@ -23,21 +19,22 @@ immediately after following these instructions.
### Step 1. Allow replication to finish if possible
If the secondary is still replicating data from the primary, follow
[the Planned Failover doc](planned_fail_over.md) as closely as possible in
[the planned failover docs](planned_failover.md) as closely as possible in
order to avoid unnecessary data loss.
### Step 2. Permanently disable the primary
**Warning: If a primary goes offline, there may be data saved on the primary
that has not been replicated to the secondary. This data should be treated
as lost if you proceed.**
CAUTION: **Warning:**
If a primary goes offline, there may be data saved on the primary
that has not been replicated to the secondary. This data should be treated
as lost if you proceed.
If an outage on your primary happens, you should do everything possible to
avoid a split-brain situation where writes can occur to two different GitLab
instances, complicating recovery efforts. So to prepare for the fail-over, we
instances, complicating recovery efforts. So to prepare for the failover, we
must disable the primary.
1. SSH into your **primary** to stop and disable GitLab, if possible.
1. SSH into your **primary** to stop and disable GitLab, if possible:
```bash
sudo gitlab-ctl stop
......@@ -60,16 +57,14 @@ must disable the primary.
1. If you do not have SSH access to your primary, take the machine offline and
prevent it from rebooting by any means at your disposal.
Since there are many ways you may prefer to accomplish this, we will avoid a
single recommendation. You may need to:
* Reconfigure load balancers
* Change DNS records (e.g. point the primary DNS record to the secondary node in order to stop usage of the primary)
* Stop virtual servers
* Block traffic through a firewall
* Revoke object storage permissions from the primary
* Physically disconnect a machine
- Reconfigure the load balancers
- Change DNS records (e.g., point the primary DNS record to the secondary node in order to stop usage of the primary)
- Stop the virtual servers
- Block traffic through a firewall
- Revoke object storage permissions from the primary
- Physically disconnect a machine
### Step 3. Promoting a secondary Geo replica
......@@ -79,9 +74,8 @@ must disable the primary.
sudo -i
```
1. Edit `/etc/gitlab/gitlab.rb` to reflect its new status as primary.
Remove the following line:
1. Edit `/etc/gitlab/gitlab.rb` to reflect its new status as primary by
removing the following line:
```ruby
## REMOVE THIS LINE
......@@ -102,7 +96,7 @@ must disable the primary.
previously for the secondary.
1. Success! The secondary has now been promoted to primary.
### Step 4. (Optional) Updating the primary domain's DNS record
### Step 4. (Optional) Updating the primary domain DNS record
Updating the DNS records for the primary domain to point to the secondary
will prevent the need to update all references to the primary domain to the
......@@ -114,10 +108,9 @@ secondary domain, like changing Git remotes and API URLs.
sudo -i
```
1. Update the primary domain's DNS record.
After updating the primary domain's DNS records to point to the secondary,
edit `/etc/gitlab/gitlab.rb` on the secondary to reflect the new URL:
1. Update the primary domain's DNS record. After updating the primary domain's
DNS records to point to the secondary, edit `/etc/gitlab/gitlab.rb` on the
secondary to reflect the new URL:
```ruby
# Change the existing external_url configuration
......@@ -140,9 +133,8 @@ secondary domain, like changing Git remotes and API URLs.
in `/etc/gitlab/gitlab.rb`.
1. Verify you can connect to the newly promoted primary using the primary URL.
If you updated the DNS records for the primary domain, these changes may
not have yet propagated depending on the previous DNS records TTL.
If you updated the DNS records for the primary domain, these changes may
not have yet propagated depending on the previous DNS records TTL.
### Step 5. (Optional) Add secondary Geo replicas to a promoted primary
......@@ -154,8 +146,13 @@ To bring a new secondary online, follow the
## Promoting secondary Geo replica in multi-secondary configurations
CAUTION: **Caution:**
Disaster Recovery for multi-secondary configurations is in
**Alpha** development. Do not use this as your only Disaster Recovery
strategy as you may lose data.
Disaster Recovery does not yet support systems with multiple
secondary Geo replicas (e.g. one primary and two or more secondaries). We are
secondary Geo replicas (e.g., one primary and two or more secondaries). We are
working on it, see [#4284](https://gitlab.com/gitlab-org/gitlab-ee/issues/4284)
for details.
......@@ -167,7 +164,7 @@ The setup instructions for Geo prior to 10.5 failed to replicate the
`otp_key_base` secret, which is used to encrypt the two-factor authentication
secrets stored in the database. If it differs between primary and secondary
nodes, users with two-factor authentication enabled won't be able to log in
after a fail-over.
after a failover.
If you still have access to the old primary node, you can follow the
instructions in the
......
# Disaster Recovery for Planned Fail-Over
A planned fail-over is similar to a disaster recovery scenario, except you are able
to notify users of the maintenance window, and allow data to finish replicating to
secondaries.
Please read this entire document as well as [Disaster Recovery](index.md)
before proceeding.
### Notify users of scheduled maintenance
1. On the primary, in Admin Area > Messages, add a broadcast message.
Check Admin Area > Geo Nodes to estimate how long it will take to finish syncing.
```
We are doing scheduled maintenance at XX:XX UTC, expected to take less than 1 hour.
```
1. On the secondary, you may need to clear the cache for the broadcast message to show up.
### Block primary traffic
1. At the scheduled time, using your cloud provider or your node's firewall, block HTTP and SSH traffic to/from the primary except for your IP and the secondary's IP.
### Allow replication to finish as much as possible
1. On the secondary, navigate to Admin Area > Geo Nodes and wait until all replication progress is 100% on the secondary "Current node".
1. Navigate to Admin Area > Monitoring > Background Jobs > Queues and wait until the "geo" queues drop ideally to 0.
### Promote the secondary
1. Finally, follow [Disaster Recovery](index.md) to promote the secondary to a primary.
# Disaster recovery for planned failover
A planned failover is similar to a disaster recovery scenario, except you are able
to notify users of the maintenance window, and allow data to finish replicating to
secondaries.
Please read this entire document as well as [Disaster Recovery](index.md)
before proceeding.
## Notify users of scheduled maintenance
On the primary, navigate to **Admin Area > Messages**, add a broadcast message.
You can check under **Admin Area > Geo Nodes** to estimate how long it will
take to finish syncing. An example message would be:
>
A scheduled maintenance will take place at XX:XX UTC. We expect it to take
less than 1 hour.
On the secondary, you may need to clear the cache for the broadcast message
to show up.
## Block primary traffic
At the scheduled time, using your cloud provider or your node's firewall, block
HTTP and SSH traffic to/from the primary except for your IP and the secondary's
IP.
## Allow replication to finish as much as possible
1. On the secondary, navigate to **Admin Area > Geo Nodes** and wait until all
replication progress is 100% on the secondary "Current node".
1. Navigate to **Admin Area > Monitoring > Background Jobs > Queues** and wait
until the "geo" queues drop ideally to 0.
## Promote the secondary
Finally, follow the [Disaster Recovery docs](index.md) to promote the secondary
to a primary.
......@@ -51,18 +51,17 @@ to reading any data available in the GitLab web interface (see [current limitati
improving speed for distributed teams
- Helps reducing the loading time for automated tasks,
custom integrations and internal workflows
- Quickly fail-over to a Geo secondary in a
- Quickly failover to a Geo secondary in a
[Disaster Recovery](../disaster_recovery/index.md) scenario
- Allows [planned fail-over](../disaster_recovery/planned_fail_over.md) to a Geo secondary
- Allows [planned failover](../disaster_recovery/planned_failover.md) to a Geo secondary
## Architecture
The following diagram illustrates the underlying architecture of Geo:
The following diagram illustrates the underlying architecture of Geo
([source diagram](https://docs.google.com/drawings/d/1Abw0P_H0Ew1-2Lj_xPDRWP87clGIke-1fil7_KQqrtE/edit)).
![Geo architecture](img/geo_architecture.png)
[Source diagram](https://docs.google.com/drawings/d/1Abw0P_H0Ew1-2Lj_xPDRWP87clGIke-1fil7_KQqrtE/edit)
In this diagram, there is one Geo primary node and one secondary. The
secondary clones repositories via git over HTTPS. Attachments, LFS objects, and
other files are downloaded via HTTPS using the GitLab API to authenticate,
......
This document was moved to [another location](../administration/geo/disaster_recovery/planned_fail_over.md).
This document was moved to [another location](../administration/geo/disaster_recovery/planned_failover.md).
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment