@@ -5,27 +5,27 @@ info: To determine the technical writer assigned to the Stage/Group associated w
...
@@ -5,27 +5,27 @@ info: To determine the technical writer assigned to the Stage/Group associated w
type:howto
type:howto
---
---
# Bring a demoted primary node back online **(PREMIUM SELF)**
# Bring a demoted primary site back online **(PREMIUM SELF)**
After a failover, it is possible to fail back to the demoted **primary**node to
After a failover, it is possible to fail back to the demoted **primary**site to
restore your original configuration. This process consists of two steps:
restore your original configuration. This process consists of two steps:
1. Making the old **primary**node a **secondary** node.
1. Making the old **primary**site a **secondary** site.
1. Promoting a **secondary**node to a **primary** node.
1. Promoting a **secondary**site to a **primary** site.
WARNING:
WARNING:
If you have any doubts about the consistency of the data on this node, we recommend setting it up from scratch.
If you have any doubts about the consistency of the data on this site, we recommend setting it up from scratch.
## Configure the former **primary** node to be a **secondary** node
## Configure the former **primary** site to be a **secondary** site
Since the former **primary**node will be out of sync with the current **primary** node, the first step is to bring the former **primary** node up to date. Note, deletion of data stored on disk like
Since the former **primary**site will be out of sync with the current **primary** site, the first step is to bring the former **primary** site up to date. Note, deletion of data stored on disk like
repositories and uploads will not be replayed when bringing the former **primary**node back
repositories and uploads will not be replayed when bringing the former **primary**site back
into sync, which may result in increased disk usage.
into sync, which may result in increased disk usage.
Alternatively, you can [set up a new **secondary** GitLab instance](../setup/index.md) to avoid this.
Alternatively, you can [set up a new **secondary** GitLab instance](../setup/index.md) to avoid this.
To bring the former **primary**node up to date:
To bring the former **primary**site up to date:
1. SSH into the former **primary**node that has fallen behind.
1. SSH into the former **primary**site that has fallen behind.
1. Make sure all the services are up:
1. Make sure all the services are up:
```shell
```shell
...
@@ -33,36 +33,36 @@ To bring the former **primary** node up to date:
...
@@ -33,36 +33,36 @@ To bring the former **primary** node up to date:
```
```
NOTE:
NOTE:
If you [disabled the **primary** node permanently](index.md#step-2-permanently-disable-the-primary-node),
If you [disabled the **primary** site permanently](index.md#step-2-permanently-disable-the-primary-site),
you need to undo those steps now. For Debian/Ubuntu you just need to run
you need to undo those steps now. For Debian/Ubuntu you just need to run
`sudo systemctl enable gitlab-runsvdir`. For CentOS 6, you need to install
`sudo systemctl enable gitlab-runsvdir`. For CentOS 6, you need to install
the GitLab instance from scratch and set it up as a **secondary**node by
the GitLab instance from scratch and set it up as a **secondary**site by
following [Setup instructions](../setup/index.md). In this case, you don't need to follow the next step.
following [Setup instructions](../setup/index.md). In this case, you don't need to follow the next step.
NOTE:
NOTE:
If you [changed the DNS records](index.md#step-4-optional-updating-the-primary-domain-dns-record)
If you [changed the DNS records](index.md#step-4-optional-updating-the-primary-domain-dns-record)
for this node during disaster recovery procedure you may need to [block
for this site during disaster recovery procedure you may need to [block
all the writes to this node](planned_failover.md#prevent-updates-to-the-primary-node)
all the writes to this site](planned_failover.md#prevent-updates-to-the-primary-node)
during this procedure.
during this procedure.
1.[Set up database replication](../setup/database.md). In this case, the **secondary**node
1.[Set up database replication](../setup/database.md). In this case, the **secondary**site
refers to the former **primary**node.
refers to the former **primary**site.
1. If [PgBouncer](../../postgresql/pgbouncer.md) was enabled on the **current secondary**node
1. If [PgBouncer](../../postgresql/pgbouncer.md) was enabled on the **current secondary**site
(when it was a primary node) disable it by editing `/etc/gitlab/gitlab.rb`
(when it was a primary site) disable it by editing `/etc/gitlab/gitlab.rb`
and running `sudo gitlab-ctl reconfigure`.
and running `sudo gitlab-ctl reconfigure`.
1. You can then set up database replication on the **secondary**node.
1. You can then set up database replication on the **secondary**site.
If you have lost your original **primary**node, follow the
If you have lost your original **primary**site, follow the
[setup instructions](../setup/index.md) to set up a new **secondary**node.
[setup instructions](../setup/index.md) to set up a new **secondary**site.
## Promote the **secondary** node to **primary** node
## Promote the **secondary** site to **primary** site
When the initial replication is complete and the **primary**node and **secondary** node are
When the initial replication is complete and the **primary**site and **secondary** site are
closely in sync, you can do a [planned failover](planned_failover.md).
closely in sync, you can do a [planned failover](planned_failover.md).
## Restore the **secondary** node
## Restore the **secondary** site
If your objective is to have two nodes again, you need to bring your **secondary**
If your objective is to have two sites again, you need to bring your **secondary**
node back online as well by repeating the first step
site back online as well by repeating the first step
([configure the former **primary** node to be a **secondary** node](#configure-the-former-primary-node-to-be-a-secondary-node))
([configure the former **primary** site to be a **secondary** site](#configure-the-former-primary-site-to-be-a-secondary-site))
@@ -16,36 +16,36 @@ For the latest updates, check the [Disaster Recovery epic for complete maturity]
...
@@ -16,36 +16,36 @@ For the latest updates, check the [Disaster Recovery epic for complete maturity]
Multi-secondary configurations require the complete re-synchronization and re-configuration of all non-promoted secondaries and
Multi-secondary configurations require the complete re-synchronization and re-configuration of all non-promoted secondaries and
causes downtime.
causes downtime.
## Promoting a **secondary** Geo node in single-secondary configurations
## Promoting a **secondary** Geo site in single-secondary configurations
We don't currently provide an automated way to promote a Geo replica and do a
We don't currently provide an automated way to promote a Geo replica and do a
failover, but you can do it manually if you have `root` access to the machine.
failover, but you can do it manually if you have `root` access to the machine.
This process promotes a **secondary** Geo node to a **primary** node. To regain
This process promotes a **secondary** Geo site to a **primary** site. To regain
geographic redundancy as quickly as possible, you should add a new **secondary**node
geographic redundancy as quickly as possible, you should add a new **secondary**site
immediately after following these instructions.
immediately after following these instructions.
### Step 1. Allow replication to finish if possible
### Step 1. Allow replication to finish if possible
If the **secondary**node is still replicating data from the **primary** node, follow
If the **secondary**site is still replicating data from the **primary** site, follow
[the planned failover docs](planned_failover.md) as closely as possible in
[the planned failover docs](planned_failover.md) as closely as possible in
order to avoid unnecessary data loss.
order to avoid unnecessary data loss.
### Step 2. Permanently disable the **primary** node
### Step 2. Permanently disable the **primary** site
WARNING:
WARNING:
If the **primary**node goes offline, there may be data saved on the **primary** node
If the **primary**site goes offline, there may be data saved on the **primary** site
that have not been replicated to the **secondary**node. This data should be treated
that have not been replicated to the **secondary**site. This data should be treated
as lost if you proceed.
as lost if you proceed.
If an outage on the **primary**node happens, you should do everything possible to
If an outage on the **primary**site happens, you should do everything possible to
avoid a split-brain situation where writes can occur in two different GitLab
avoid a split-brain situation where writes can occur in two different GitLab
instances, complicating recovery efforts. So to prepare for the failover, we
instances, complicating recovery efforts. So to prepare for the failover, we
must disable the **primary**node.
must disable the **primary**site.
- If you have SSH access:
- If you have SSH access:
1. SSH into the **primary**node to stop and disable GitLab:
1. SSH into the **primary**site to stop and disable GitLab:
```shell
```shell
sudo gitlab-ctl stop
sudo gitlab-ctl stop
...
@@ -57,35 +57,35 @@ must disable the **primary** node.
...
@@ -57,35 +57,35 @@ must disable the **primary** node.
sudo systemctl disable gitlab-runsvdir
sudo systemctl disable gitlab-runsvdir
```
```
- If you do not have SSH access to the **primary**node, take the machine offline and
- If you do not have SSH access to the **primary**site, take the machine offline and
prevent it from rebooting by any means at your disposal.
prevent it from rebooting by any means at your disposal.
You might need to:
You might need to:
- Reconfigure the load balancers.
- Reconfigure the load balancers.
- Change DNS records (for example, point the primary DNS record to the
- Change DNS records (for example, point the primary DNS record to the
**secondary**node to stop usage of the **primary** node).
**secondary**site to stop usage of the **primary** site).
- Stop the virtual servers.
- Stop the virtual servers.
- Block traffic through a firewall.
- Block traffic through a firewall.
- Revoke object storage permissions from the **primary**node.
- Revoke object storage permissions from the **primary**site.
- Physically disconnect a machine.
- Physically disconnect a machine.
If you plan to [update the primary domain DNS record](#step-4-optional-updating-the-primary-domain-dns-record),
If you plan to [update the primary domain DNS record](#step-4-optional-updating-the-primary-domain-dns-record),
you may wish to lower the TTL now to speed up propagation.
you may wish to lower the TTL now to speed up propagation.
### Step 3. Promoting a **secondary** node
### Step 3. Promoting a **secondary** site
WARNING:
WARNING:
In GitLab 13.2 and 13.3, promoting a secondary node to a primary while the
In GitLab 13.2 and 13.3, promoting a secondary site to a primary while the
secondary is paused fails. Do not pause replication before promoting a
secondary is paused fails. Do not pause replication before promoting a
secondary. If the node is paused, be sure to resume before promoting.
secondary. If the secondary site is paused, be sure to resume before promoting.
This issue has been fixed in GitLab 13.4 and later.
This issue has been fixed in GitLab 13.4 and later.
Note the following when promoting a secondary:
Note the following when promoting a secondary:
- If replication was paused on the secondary node (for example as a part of
- If replication was paused on the secondary site (for example as a part of
upgrading, while you were running a version of GitLab earlier than 13.4), you
upgrading, while you were running a version of GitLab earlier than 13.4), you
_must_ [enable the node by using the database](../replication/troubleshooting.md#message-activerecordrecordinvalid-validation-failed-enabled-geo-primary-node-cannot-be-disabled)
_must_ [enable the site by using the database](../replication/troubleshooting.md#message-activerecordrecordinvalid-validation-failed-enabled-geo-primary-node-cannot-be-disabled)
before proceeding. If the secondary node
before proceeding. If the secondary site
[has been paused](../../geo/index.md#pausing-and-resuming-replication), the promotion
[has been paused](../../geo/index.md#pausing-and-resuming-replication), the promotion
performs a point-in-time recovery to the last known state.
performs a point-in-time recovery to the last known state.
Data that was created on the primary while the secondary was paused is lost.
Data that was created on the primary while the secondary was paused is lost.
...
@@ -99,7 +99,32 @@ Note the following when promoting a secondary:
...
@@ -99,7 +99,32 @@ Note the following when promoting a secondary:
@@ -352,12 +526,12 @@ secondary domain, like changing Git remotes and API URLs.
...
@@ -352,12 +526,12 @@ secondary domain, like changing Git remotes and API URLs.
If you updated the DNS records for the primary domain, these changes may
If you updated the DNS records for the primary domain, these changes may
not have yet propagated depending on the previous DNS records TTL.
not have yet propagated depending on the previous DNS records TTL.
### Step 5. (Optional) Add **secondary** Geo node to a promoted **primary** node
### Step 5. (Optional) Add **secondary** Geo site to a promoted **primary** site
Promoting a **secondary**node to **primary** node using the process above does not enable
Promoting a **secondary**site to **primary** site using the process above does not enable
Geo on the new **primary**node.
Geo on the new **primary**site.
To bring a new **secondary**node online, follow the [Geo setup instructions](../index.md#setup-instructions).
To bring a new **secondary**site online, follow the [Geo setup instructions](../index.md#setup-instructions).
### Step 6. (Optional) Removing the secondary's tracking database
### Step 6. (Optional) Removing the secondary's tracking database
...
@@ -376,13 +550,13 @@ for the changes to take effect.
...
@@ -376,13 +550,13 @@ for the changes to take effect.
## Promoting secondary Geo replica in multi-secondary configurations
## Promoting secondary Geo replica in multi-secondary configurations
If you have more than one **secondary**node and you need to promote one of them, we suggest you follow
If you have more than one **secondary**site and you need to promote one of them, we suggest you follow
[Promoting a **secondary** Geo node in single-secondary configurations](#promoting-a-secondary-geo-node-in-single-secondary-configurations)
[Promoting a **secondary** Geo site in single-secondary configurations](#promoting-a-secondary-geo-site-in-single-secondary-configurations)
and after that you also need two extra steps.
and after that you also need two extra steps.
### Step 1. Prepare the new **primary** node to serve one or more **secondary** nodes
### Step 1. Prepare the new **primary** site to serve one or more **secondary** sites
1. SSH into the new **primary**node and login as root:
1. SSH into the new **primary**site and login as root:
```shell
```shell
sudo-i
sudo-i
...
@@ -442,13 +616,13 @@ and after that you also need two extra steps.
...
@@ -442,13 +616,13 @@ and after that you also need two extra steps.
### Step 2. Initiate the replication process
### Step 2. Initiate the replication process
Now we need to make each **secondary**node listen to changes on the new **primary** node. To do that you need
Now we need to make each **secondary**site listen to changes on the new **primary** site. To do that you need
to [initiate the replication process](../setup/database.md#step-3-initiate-the-replication-process) again but this time
to [initiate the replication process](../setup/database.md#step-3-initiate-the-replication-process) again but this time
for another **primary**node. All the old replication settings are overwritten.
for another **primary**site. All the old replication settings are overwritten.
## Promoting a secondary Geo cluster in GitLab Cloud Native Helm Charts
## Promoting a secondary Geo cluster in GitLab Cloud Native Helm Charts
When updating a Cloud Native Geo deployment, the process for updating any node that is external to the secondary Kubernetes cluster does not differ from the non Cloud Native approach. As such, you can always defer to [Promoting a secondary Geo node in single-secondary configurations](#promoting-a-secondary-geo-node-in-single-secondary-configurations) for more information.
When updating a Cloud Native Geo deployment, the process for updating any node that is external to the secondary Kubernetes cluster does not differ from the non Cloud Native approach. As such, you can always defer to [Promoting a secondary Geo site in single-secondary configurations](#promoting-a-secondary-geo-site-in-single-secondary-configurations) for more information.
The following sections assume you are using the `gitlab` namespace. If you used a different namespace when setting up your cluster, you should also replace `--namespace gitlab` with your namespace.
The following sections assume you are using the `gitlab` namespace. If you used a different namespace when setting up your cluster, you should also replace `--namespace gitlab` with your namespace.
...
@@ -489,13 +663,45 @@ must disable the **primary** site:
...
@@ -489,13 +663,45 @@ must disable the **primary** site:
- Revoke object storage permissions from the **primary** site.
- Revoke object storage permissions from the **primary** site.
- Physically disconnect a machine.
- Physically disconnect a machine.
### Step 2. Promote all **secondary** nodes external to the cluster
### Step 2. Promote all **secondary** sites external to the cluster
WARNING:
WARNING:
If the secondary site [has been paused](../../geo/index.md#pausing-and-resuming-replication), this performs
If the secondary site [has been paused](../../geo/index.md#pausing-and-resuming-replication), this performs
a point-in-time recovery to the last known state.
a point-in-time recovery to the last known state.
Data that was created on the primary while the secondary was paused is lost.
Data that was created on the primary while the secondary was paused is lost.
If you are running GitLab 14.5 and later:
1. SSH to every Sidekiq, PostgresSQL, and Gitaly node in the **secondary** site and run one of the following commands:
- To promote the secondary node to primary:
```shell
sudo gitlab-ctl geo promote
```
- To promote the secondary node to primary **without any further confirmation**:
```shell
sudo gitlab-ctl geo promote --force
```
1. SSH into each Rails node on your **secondary** site and run one of the following commands:
- To promote the secondary node to primary:
```shell
sudo gitlab-ctl geo promote
```
- To promote the secondary node to primary **without any further confirmation**:
```shell
sudo gitlab-ctl geo promote --force
```
If you are running GitLab 14.4 and earlier:
1. SSH in to the database node in the **secondary** and trigger PostgreSQL to
1. SSH in to the database node in the **secondary** and trigger PostgreSQL to
promote to read-write:
promote to read-write:
...
@@ -522,8 +728,6 @@ Data that was created on the primary while the secondary was paused is lost.
...
@@ -522,8 +728,6 @@ Data that was created on the primary while the secondary was paused is lost.
After making these changes, [reconfigure GitLab](../../restart_gitlab.md#omnibus-gitlab-reconfigure) on the database node.
After making these changes, [reconfigure GitLab](../../restart_gitlab.md#omnibus-gitlab-reconfigure) on the database node.
### Step 3. Promote the **secondary** cluster
1. Find the task runner pod:
1. Find the task runner pod:
```shell
```shell
...
@@ -536,6 +740,8 @@ Data that was created on the primary while the secondary was paused is lost.
...
@@ -536,6 +740,8 @@ Data that was created on the primary while the secondary was paused is lost.
@@ -66,13 +66,13 @@ promote a Geo replica and perform a failover.
...
@@ -66,13 +66,13 @@ promote a Geo replica and perform a failover.
NOTE:
NOTE:
GitLab 13.9 through GitLab 14.3 are affected by a bug in which the Geo secondary site statuses will appear to stop updating and become unhealthy. For more information, see [Geo Admin Area shows 'Unhealthy' after enabling Maintenance Mode](../../replication/troubleshooting.md#geo-admin-area-shows-unhealthy-after-enabling-maintenance-mode).
GitLab 13.9 through GitLab 14.3 are affected by a bug in which the Geo secondary site statuses will appear to stop updating and become unhealthy. For more information, see [Geo Admin Area shows 'Unhealthy' after enabling Maintenance Mode](../../replication/troubleshooting.md#geo-admin-area-shows-unhealthy-after-enabling-maintenance-mode).
On the **secondary**node:
On the **secondary**site:
1. On the top bar, select **Menu > Admin**.
1. On the top bar, select **Menu > Admin**.
1. On the left sidebar, select **Geo > Nodes** to see its status.
1. On the left sidebar, select **Geo > Nodes** to see its status.
Replicated objects (shown in green) should be close to 100%,
Replicated objects (shown in green) should be close to 100%,
and there should be no failures (shown in red). If a large proportion of
and there should be no failures (shown in red). If a large proportion of
objects aren't yet replicated (shown in gray), consider giving the node more
objects aren't yet replicated (shown in gray), consider giving the site more
@@ -54,10 +54,10 @@ promote a Geo replica and perform a failover.
...
@@ -54,10 +54,10 @@ promote a Geo replica and perform a failover.
NOTE:
NOTE:
GitLab 13.9 through GitLab 14.3 are affected by a bug in which the Geo secondary site statuses will appear to stop updating and become unhealthy. For more information, see [Geo Admin Area shows 'Unhealthy' after enabling Maintenance Mode](../../replication/troubleshooting.md#geo-admin-area-shows-unhealthy-after-enabling-maintenance-mode).
GitLab 13.9 through GitLab 14.3 are affected by a bug in which the Geo secondary site statuses will appear to stop updating and become unhealthy. For more information, see [Geo Admin Area shows 'Unhealthy' after enabling Maintenance Mode](../../replication/troubleshooting.md#geo-admin-area-shows-unhealthy-after-enabling-maintenance-mode).
On the **secondary**node, navigate to the **Admin Area > Geo** dashboard to
On the **secondary**site, navigate to the **Admin Area > Geo** dashboard to
review its status. Replicated objects (shown in green) should be close to 100%,
review its status. Replicated objects (shown in green) should be close to 100%,
and there should be no failures (shown in red). If a large proportion of
and there should be no failures (shown in red). If a large proportion of
objects aren't yet replicated (shown in gray), consider giving the node more
objects aren't yet replicated (shown in gray), consider giving the site more
@@ -683,7 +683,7 @@ when promoting a secondary to a primary node with strategies to resolve them.
...
@@ -683,7 +683,7 @@ when promoting a secondary to a primary node with strategies to resolve them.
### Message: ActiveRecord::RecordInvalid: Validation failed: Name has already been taken
### Message: ActiveRecord::RecordInvalid: Validation failed: Name has already been taken
When [promoting a **secondary** node](../disaster_recovery/index.md#step-3-promoting-a-secondary-node),
When [promoting a **secondary** site](../disaster_recovery/index.md#step-3-promoting-a-secondary-site),
you might encounter the following error:
you might encounter the following error:
```plaintext
```plaintext
...
@@ -751,7 +751,7 @@ This can be fixed in the database.
...
@@ -751,7 +751,7 @@ This can be fixed in the database.
### Message: ``NoMethodError: undefined method `secondary?' for nil:NilClass``
### Message: ``NoMethodError: undefined method `secondary?' for nil:NilClass``
When [promoting a **secondary** node](../disaster_recovery/index.md#step-3-promoting-a-secondary-node),
When [promoting a **secondary** site](../disaster_recovery/index.md#step-3-promoting-a-secondary-site),
you might encounter the following error:
you might encounter the following error:
```plaintext
```plaintext
...
@@ -767,13 +767,13 @@ Tasks: TOP => geo:set_secondary_as_primary
...
@@ -767,13 +767,13 @@ Tasks: TOP => geo:set_secondary_as_primary
(See full trace by running task with --trace)
(See full trace by running task with --trace)
```
```
This command is intended to be executed on a secondary node only, and this error
This command is intended to be executed on a secondary site only, and this error
is displayed if you attempt to run this command on a primary node.
is displayed if you attempt to run this command on a primary site.
### Message: `sudo: gitlab-pg-ctl: command not found`
### Message: `sudo: gitlab-pg-ctl: command not found`
When
When
[promoting a **secondary** node with multiple servers](../disaster_recovery/index.md#promoting-a-secondary-node-with-multiple-servers),
[promoting a **secondary** site with multiple nodes](../disaster_recovery/index.md#promoting-a-secondary-site-with-multiple-nodes-running-gitlab-144-and-earlier),
you need to run the `gitlab-pg-ctl` command to promote the PostgreSQL
you need to run the `gitlab-pg-ctl` command to promote the PostgreSQL