Refactor the Consul docs

Update the Consul docs and prepare them to be used in the reference architectures docs.

Refactor the Consul docs
Update the Consul docs and prepare them to be used in the reference architectures docs.
a8292c2c · Achilleas Pipinellis · eb38a133 · a8292c2c · a8292c2c
Commit a8292c2c authored May 12, 2020 by Achilleas Pipinellis
2 changed files
--- a/doc/administration/high_availability/consul.md
+++ b/doc/administration/high_availability/consul.md
@@ -2,134 +2,159 @@
 type: reference
 ---

-# Working with the bundled Consul service **(PREMIUM ONLY)**
+# How to set up Consul **(PREMIUM ONLY)**

-As part of its High Availability stack, GitLab Premium includes a bundled version of [Consul](https://www.consul.io/) that can be managed through `/etc/gitlab/gitlab.rb`. Consul is a service networking solution. When it comes to [GitLab Architecture](../../development/architecture.md), Consul utilization is supported for configuring:
+GitLab Premium includes a bundled version of [Consul](https://www.consul.io/)
+a service networking solution that you can manage by using `/etc/gitlab/gitlab.rb`.

-1. [Monitoring in Scaled and Highly Available environments](monitoring_node.md)
-1. [PostgreSQL High Availability with Omnibus](../postgresql/replication_and_failover.md)
+A Consul cluster consists of both
+[server and client agents](https://www.consul.io/docs/agent).
+The servers run on their own nodes and the clients run on other nodes that, in turn, communicate with
+the servers.

-A Consul cluster consists of multiple server agents, as well as client agents that run on other nodes which need to talk to the Consul cluster.
+## Configure the Consul nodes

-## Prerequisites
+> - `consul_role` was introduced in GitLab 10.3.

-First, make sure to [download/install](https://about.gitlab.com/install/)
-Omnibus GitLab **on each node**.
+NOTE: **Important:**
+Before proceeding, refer to the
+[available reference architectures](../reference_architectures/index.md#available-reference-architectures)
+to find out how many Consul server nodes you should have.

-Choose an installation method, then make sure you complete steps:
+On **each** Consul server node perform the following:

-1. Install and configure the necessary dependencies.
-1. Add the GitLab package repository and install the package.
-
-When installing the GitLab package, do not supply `EXTERNAL_URL` value.
-
-## Configuring the Consul nodes
-
-On each Consul node perform the following:
-
-1. Make sure you collect [`CONSUL_SERVER_NODES`](../postgresql/replication_and_failover.md#consul-information), which are the IP addresses or DNS records of the Consul server nodes, for the next step, before executing the next step.
-
-1. Edit `/etc/gitlab/gitlab.rb` replacing values noted in the `# START user configuration` section:
+1. Follow the instructions to [install](https://about.gitlab.com/install/)
+   GitLab by choosing your preferred platform, but do not supply the
+   `EXTERNAL_URL` value when asked.
+1. Edit `/etc/gitlab/gitlab.rb`, and add the following by replacing the values
+   noted in the `retry_join` section. In the example below, there are three
+   nodes, two denoted with their IP, and one with its FQDN, you can use either
+   notation:

   ```ruby
   # Disable all components except Consul
   roles ['consul_role']

-   # START user configuration
-   # Replace placeholders:
-   #
-   # Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z
-   # with the addresses gathered for CONSUL_SERVER_NODES
+   # Consul nodes: can be FQDN or IP, separated by a whitespace
   consul['configuration'] = {
     server: true,
-     retry_join: %w(Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z)
+     retry_join: %w(10.10.10.1 consul1.gitlab.example.com 10.10.10.2)
   }

   # Disable auto migrations
   gitlab_rails['auto_migrate'] = false
-   #
-   # END user configuration
   ```

-   > `consul_role` was introduced with GitLab 10.3
-
 1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes
   to take effect.
+1. Run the following command to ensure Consul is both configured correctly and
+   to verify that all server nodes are communicating:

-### Consul checkpoint
+   ```shell
+   sudo /opt/gitlab/embedded/bin/consul members
+   ```

-Before moving on, make sure Consul is configured correctly. Run the following
-command to verify all server nodes are communicating:
+   The output should be similar to:

-```shell
-/opt/gitlab/embedded/bin/consul members
-```
+   ```plaintext
+   Node                 Address               Status  Type    Build  Protocol  DC
+   CONSUL_NODE_ONE      XXX.XXX.XXX.YYY:8301  alive   server  0.9.2  2         gitlab_consul
+   CONSUL_NODE_TWO      XXX.XXX.XXX.YYY:8301  alive   server  0.9.2  2         gitlab_consul
+   CONSUL_NODE_THREE    XXX.XXX.XXX.YYY:8301  alive   server  0.9.2  2         gitlab_consul
+   ```

-The output should be similar to:
+   If the results display any nodes with a status that isn't `alive`, or if any
+   of the three nodes are missing, see the [Troubleshooting section](#troubleshooting-consul).

-```plaintext
-Node                 Address               Status  Type    Build  Protocol  DC
-CONSUL_NODE_ONE      XXX.XXX.XXX.YYY:8301  alive   server  0.9.2  2         gitlab_consul
-CONSUL_NODE_TWO      XXX.XXX.XXX.YYY:8301  alive   server  0.9.2  2         gitlab_consul
-CONSUL_NODE_THREE    XXX.XXX.XXX.YYY:8301  alive   server  0.9.2  2         gitlab_consul
-```
+## Upgrade the Consul nodes

-If any of the nodes isn't `alive` or if any of the three nodes are missing,
-check the [Troubleshooting section](#troubleshooting) before proceeding.
+To upgrade your Consul nodes, upgrade the GitLab package.

-## Operations
+Nodes should be:

-### Checking cluster membership
+- Members of a healthy cluster prior to upgrading the Omnibus GitLab package.
+- Upgraded one node at a time.

-To see which nodes are part of the cluster, run the following on any member in the cluster
+Identify any existing health issues in the cluster by running the following command
+within each node. The command will return an empty array if the cluster is healthy:

 ```shell
-$ /opt/gitlab/embedded/bin/consul members
-Node            Address               Status  Type    Build  Protocol  DC
-consul-b        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
-consul-c        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
-consul-c        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
-db-a            XX.XX.X.Y:8301        alive   client  0.9.0  2         gitlab_consul
-db-b            XX.XX.X.Y:8301        alive   client  0.9.0  2         gitlab_consul
+curl http://127.0.0.1:8500/v1/health/state/critical
 ```

-Ideally all nodes will have a `Status` of `alive`.
+Consul nodes communicate using the raft protocol. If the current leader goes
+offline, there needs to be a leader election. A leader node must exist to facilitate
+synchronization across the cluster. If too many nodes go offline at the same time,
+the cluster will lose quorum and not elect a leader due to
+[broken consensus](https://www.consul.io/docs/internals/consensus.html).

-### Restarting the server cluster
+Consult the [troubleshooting section](#troubleshooting-consul) if the cluster is not
+able to recover after the upgrade. The [outage recovery](#outage-recovery) may
+be of particular interest.

 NOTE: **Note:**
-This section only applies to server agents. It is safe to restart client agents whenever needed.
+GitLab uses Consul to store only transient data that is easily regenerated. If
+the bundled Consul was not used by any process other than GitLab itself, then
+[rebuilding the cluster from scratch](#recreate-from-scratch) is fine.

-If it is necessary to restart the server cluster, it is important to do this in a controlled fashion in order to maintain quorum. If quorum is lost, you will need to follow the Consul [outage recovery](#outage-recovery) process to recover the cluster.
+## Troubleshooting Consul

-To be safe, we recommend you only restart one server agent at a time to ensure the cluster remains intact.
+Below are some useful operations should you need to debug any issues.
+You can see any error logs by running:

-For larger clusters, it is possible to restart multiple agents at a time. See the [Consul consensus document](https://www.consul.io/docs/internals/consensus.html#deployment-table) for how many failures it can tolerate. This will be the number of simultaneous restarts it can sustain.
+```shell
+sudo gitlab-ctl tail consul
+```

-## Upgrades for bundled Consul
+### Check the cluster membership

-Nodes running GitLab-bundled Consul should be:
+To determine which nodes are part of the cluster, run the following on any member in the cluster:

- Members of a healthy cluster prior to upgrading the Omnibus GitLab package.
- Upgraded one node at a time.
+```shell
+sudo /opt/gitlab/embedded/bin/consul members
+```

-NOTE: **Note:**
-Running `curl http://127.0.0.1:8500/v1/health/state/critical` from any Consul node will identify existing health issues in the cluster. The command will return an empty array if the cluster is healthy.
+The output should be similar to:
+
+```plaintext
+Node            Address               Status  Type    Build  Protocol  DC
+consul-b        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
+consul-c        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
+consul-c        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
+db-a            XX.XX.X.Y:8301        alive   client  0.9.0  2         gitlab_consul
+db-b            XX.XX.X.Y:8301        alive   client  0.9.0  2         gitlab_consul
+```
+
+Ideally all nodes will have a `Status` of `alive`.

-Consul clusters communicate using the raft protocol. If the current leader goes offline, there needs to be a leader election. A leader node must exist to facilitate synchronization across the cluster. If too many nodes go offline at the same time, the cluster will lose quorum and not elect a leader due to [broken consensus](https://www.consul.io/docs/internals/consensus.html).
+### Restart Consul

-Consult the [troubleshooting section](#troubleshooting) if the cluster is not able to recover after the upgrade. The [outage recovery](#outage-recovery) may be of particular interest.
+If it is necessary to restart Consul, it is important to do this in
+a controlled manner to maintain quorum. If quorum is lost, to recover the cluster,
+you will need to follow the Consul [outage recovery](#outage-recovery) process.

-NOTE: **Note:**
-GitLab only uses Consul to store transient data that is easily regenerated. If the bundled Consul was not used by any process other than GitLab itself, then [rebuilding the cluster from scratch](#recreate-from-scratch) is fine.
+To be safe, it's recommended that you only restart Consul in one node at a time to
+ensure the cluster remains intact. For larger clusters, it is possible to restart
+multiple nodes at a time. See the
+[Consul consensus document](https://www.consul.io/docs/internals/consensus.html#deployment-table)
+for how many failures it can tolerate. This will be the number of simultaneous
+restarts it can sustain.
+
+To restart Consul:

-## Troubleshooting
+```shell
+sudo gitlab-ctl restart consul
+```

-### Consul server agents unable to communicate
+### Consul nodes unable to communicate

-By default, the server agents will attempt to [bind](https://www.consul.io/docs/agent/options.html#_bind) to '0.0.0.0', but they will advertise the first private IP address on the node for other agents to communicate with them. If the other nodes cannot communicate with a node on this address, then the cluster will have a failed status.
+By default, Consul will attempt to
+[bind](https://www.consul.io/docs/agent/options.html#_bind) to `0.0.0.0`, but
+it will advertise the first private IP address on the node for other Consul nodes
+to communicate with it. If the other nodes cannot communicate with a node on
+this address, then the cluster will have a failed status.

-You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue:
+If you are running into this issue, you will see messages like the following in `gitlab-ctl tail consul` output:

 ```plaintext
 2017-09-25_19:53:39.90821     2017/09/25 19:53:39 [WARN] raft: no known peers, aborting election
@@ -148,15 +173,21 @@ To fix this:
   }
   ```

-1. Run `gitlab-ctl reconfigure`
+1. Reconfigure GitLab;
+
+   ```shell
+   gitlab-ctl reconfigure
+   ```

-If you still see the errors, you may have to [erase the Consul database and reinitialize](#recreate-from-scratch) on the affected node.
+If you still see the errors, you may have to
+[erase the Consul database and reinitialize](#recreate-from-scratch) on the affected node.

-### Consul agents do not start - Multiple private IPs
+### Consul does not start - multiple private IPs

-In the case that a node has multiple private IPs the agent be confused as to which of the private addresses to advertise, and then immediately exit on start.
+In case that a node has multiple private IPs, Consul will be confused as to
+which of the private addresses to advertise, and then immediately exit on start.

-You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue:
+You will see messages like the following in `gitlab-ctl tail consul` output:

 ```plaintext
 2017-11-09_17:41:45.52876 ==> Starting Consul agent...
@@ -175,24 +206,36 @@ To fix this:
   }
   ```

-1. Run `gitlab-ctl reconfigure`
+1. Reconfigure GitLab;
+
+   ```shell
+   gitlab-ctl reconfigure
+   ```

 ### Outage recovery

-If you lost enough server agents in the cluster to break quorum, then the cluster is considered failed, and it will not function without manual intervention.
+If you lost enough Consul nodes in the cluster to break quorum, then the cluster
+is considered failed, and it will not function without manual intervention.
+In that case, you can either recreate the nodes from scratch or attempt a
+recover.

 #### Recreate from scratch

-By default, GitLab does not store anything in the Consul cluster that cannot be recreated. To erase the Consul database and reinitialize
+By default, GitLab does not store anything in the Consul node that cannot be
+recreated. To erase the Consul database and reinitialize:

 ```shell
-gitlab-ctl stop consul
-rm -rf /var/opt/gitlab/consul/data
-gitlab-ctl start consul
+sudo gitlab-ctl stop consul
+sudo rm -rf /var/opt/gitlab/consul/data
+sudo gitlab-ctl start consul
 ```

-After this, the cluster should start back up, and the server agents rejoin. Shortly after that, the client agents should rejoin as well.
+After this, the node should start back up, and the rest of the server agents rejoin.
+Shortly after that, the client agents should rejoin as well.

-#### Recover a failed cluster
+#### Recover a failed node

-If you have taken advantage of Consul to store other data, and want to restore the failed cluster, please follow the [Consul guide](https://learn.hashicorp.com/consul/day-2-operations/outage) to recover a failed cluster.
+If you have taken advantage of Consul to store other data and want to restore
+the failed node, follow the
+[Consul guide](https://learn.hashicorp.com/consul/day-2-operations/outage)
+to recover a failed cluster.
--- a/doc/administration/postgresql/replication_and_failover.md
+++ b/doc/administration/postgresql/replication_and_failover.md
@@ -1087,7 +1087,7 @@ To restart either service, run `gitlab-ctl restart SERVICE`

 For PostgreSQL, it is usually safe to restart the master node by default. Automatic failover defaults to a 1 minute timeout. Provided the database returns before then, nothing else needs to be done. To be safe, you can stop `repmgrd` on the standby nodes first with `gitlab-ctl stop repmgrd`, then start afterwards with `gitlab-ctl start repmgrd`.

-On the Consul server nodes, it is important to restart the Consul service in a controlled fashion. Read our [Consul documentation](../high_availability/consul.md#restarting-the-server-cluster) for instructions on how to restart the service.
+On the Consul server nodes, it is important to [restart the Consul service](../high_availability/consul.md#restart-consul) in a controlled manner.

 ### `gitlab-ctl repmgr-check-master` command produces errors

@@ -1136,7 +1136,7 @@ postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>)

 If you're running into an issue with a component not outlined here, be sure to check the troubleshooting section of their specific documentation page.

- [Consul](../high_availability/consul.md#troubleshooting)
+- [Consul](../high_availability/consul.md#troubleshooting-consul)
 - [PostgreSQL](https://docs.gitlab.com/omnibus/settings/database.html#troubleshooting)
 - [GitLab application](../high_availability/gitlab.md#troubleshooting)