Commit b3196928 authored by Achilleas Pipinellis's avatar Achilleas Pipinellis

Merge branch 'gy-further-ha-updates' into 'master'

HA Docs Object Storage Updates

See merge request gitlab-org/gitlab!18866
parents 1b47c8a6 279aa7e3
...@@ -559,6 +559,9 @@ a few things that you need to do: ...@@ -559,6 +559,9 @@ a few things that you need to do:
including [incremental logging](../job_logs.md#new-incremental-logging-architecture). including [incremental logging](../job_logs.md#new-incremental-logging-architecture).
1. Configure [object storage for LFS objects](../lfs/lfs_administration.md#storing-lfs-objects-in-remote-object-storage). 1. Configure [object storage for LFS objects](../lfs/lfs_administration.md#storing-lfs-objects-in-remote-object-storage).
1. Configure [object storage for uploads](../uploads.md#using-object-storage-core-only). 1. Configure [object storage for uploads](../uploads.md#using-object-storage-core-only).
1. Configure [object storage for Merge Request Diffs](../merge_request_diffs.md#using-object-storage).
1. Configure [object storage for Packages](../packages/index.md#using-object-storage) (Optional Feature).
1. Configure [object storage for Dependency Proxy](../packages/dependency_proxy.md#using-object-storage) (Optional Feature).
NOTE: **Note:** NOTE: **Note:**
One current feature of GitLab that still requires a shared directory (NFS) is One current feature of GitLab that still requires a shared directory (NFS) is
......
...@@ -38,14 +38,17 @@ The following components need to be considered for a scaled or highly-available ...@@ -38,14 +38,17 @@ The following components need to be considered for a scaled or highly-available
environment. In many cases, components can be combined on the same nodes to reduce environment. In many cases, components can be combined on the same nodes to reduce
complexity. complexity.
- Unicorn/Workhorse - Web-requests (UI, API, Git over HTTP) - GitLab application nodes (Unicorn / Puma, Workhorse) - Web-requests (UI, API, Git over HTTP)
- Sidekiq - Asynchronous/Background jobs - Sidekiq - Asynchronous/Background jobs
- PostgreSQL - Database - PostgreSQL - Database
- Consul - Database service discovery and health checks/failover - Consul - Database service discovery and health checks/failover
- PgBouncer - Database pool manager - PgBouncer - Database pool manager
- Redis - Key/Value store (User sessions, cache, queue for Sidekiq) - Redis - Key/Value store (User sessions, cache, queue for Sidekiq)
- Sentinel - Redis health check/failover manager - Sentinel - Redis health check/failover manager
- Gitaly - Provides high-level RPC access to Git repositories - Gitaly - Provides high-level storage and RPC access to Git repositories
- S3 Object Storage service[^3] and / or NFS storage servers[^4] for entities such as Uploads, Artifacts, LFS Objects, etc...
- Load Balancer[^2] - Main entry point and handles load balancing for the GitLab application nodes.
- Monitor - Prometheus and Grafana monitoring with auto discovery.
## Scalable Architecture Examples ## Scalable Architecture Examples
...@@ -67,8 +70,10 @@ larger one. ...@@ -67,8 +70,10 @@ larger one.
- 1 PostgreSQL node - 1 PostgreSQL node
- 1 Redis node - 1 Redis node
- 1 NFS/Gitaly storage server - 1 Gitaly node
- 2 or more GitLab application nodes (Unicorn, Workhorse, Sidekiq) - 1 or more Object Storage services[^3] and / or NFS storage server[^4]
- 2 or more GitLab application nodes (Unicorn / Puma, Workhorse, Sidekiq)
- 1 or more Load Balancer nodes[^2]
- 1 Monitoring node (Prometheus, Grafana) - 1 Monitoring node (Prometheus, Grafana)
#### Installation Instructions #### Installation Instructions
...@@ -79,8 +84,10 @@ you can continue with the next step. ...@@ -79,8 +84,10 @@ you can continue with the next step.
1. [PostgreSQL](database.md#postgresql-in-a-scaled-environment) 1. [PostgreSQL](database.md#postgresql-in-a-scaled-environment)
1. [Redis](redis.md#redis-in-a-scaled-environment) 1. [Redis](redis.md#redis-in-a-scaled-environment)
1. [Gitaly](gitaly.md) (recommended) or [NFS](nfs.md) 1. [Gitaly](gitaly.md) (recommended) and / or [NFS](nfs.md)[^4]
1. [GitLab application nodes](gitlab.md) 1. [GitLab application nodes](gitlab.md)
- With [Object Storage service enabled](../gitaly/index.md#eliminating-nfs-altogether)[^3]
1. [Load Balancer](load_balancer.md)[^2]
1. [Monitoring node (Prometheus and Grafana)](monitoring_node.md) 1. [Monitoring node (Prometheus and Grafana)](monitoring_node.md)
### Full Scaling ### Full Scaling
...@@ -91,11 +98,13 @@ is split into separate Sidekiq and Unicorn/Workhorse nodes. One indication that ...@@ -91,11 +98,13 @@ is split into separate Sidekiq and Unicorn/Workhorse nodes. One indication that
this architecture is required is if Sidekiq queues begin to periodically increase this architecture is required is if Sidekiq queues begin to periodically increase
in size, indicating that there is contention or there are not enough resources. in size, indicating that there is contention or there are not enough resources.
- 1 PostgreSQL node - 1 or more PostgreSQL node
- 1 Redis node - 1 or more Redis node
- 2 or more NFS/Gitaly storage servers - 1 or more Gitaly storage servers
- 1 or more Object Storage services[^3] and / or NFS storage server[^4]
- 2 or more Sidekiq nodes - 2 or more Sidekiq nodes
- 2 or more GitLab application nodes (Unicorn, Workhorse) - 2 or more GitLab application nodes (Unicorn / Puma, Workhorse, Sidekiq)
- 1 or more Load Balancer nodes[^2]
- 1 Monitoring node (Prometheus, Grafana) - 1 Monitoring node (Prometheus, Grafana)
## High Availability Architecture Examples ## High Availability Architecture Examples
...@@ -114,10 +123,10 @@ This may lead to the other nodes believing a failure has occurred and initiating ...@@ -114,10 +123,10 @@ This may lead to the other nodes believing a failure has occurred and initiating
automated failover. Isolating Redis and Consul from the services they monitor automated failover. Isolating Redis and Consul from the services they monitor
reduces the chances of a false positive that a failure has occurred. reduces the chances of a false positive that a failure has occurred.
The examples below do not really address high availability of NFS. Some enterprises The examples below do not address high availability of NFS for objects. We recommend a
have access to NFS appliances that manage availability. This is the best case S3 Object Storage service[^3] is used where possible over NFS but it's still required in
scenario. In the future, GitLab may offer a more user-friendly solution to certain cases[^4]. Where NFS is to be used some enterprises have access to NFS appliances
[GitLab HA Storage](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2472). that manage availability and this would be best case scenario.
There are many options in between each of these examples. Work with GitLab Support There are many options in between each of these examples. Work with GitLab Support
to understand the best starting point for your workload and adapt from there. to understand the best starting point for your workload and adapt from there.
...@@ -138,8 +147,10 @@ the contention. ...@@ -138,8 +147,10 @@ the contention.
- 3 PostgreSQL nodes - 3 PostgreSQL nodes
- 2 Redis nodes - 2 Redis nodes
- 3 Consul/Sentinel nodes - 3 Consul/Sentinel nodes
- 2 or more GitLab application nodes (Unicorn, Workhorse, Sidekiq, PgBouncer) - 2 or more GitLab application nodes (Unicorn / Puma, Workhorse, Sidekiq)
- 1 NFS/Gitaly server - 1 Gitaly storage servers
- 1 Object Storage service[^3] and / or NFS storage server[^4]
- 1 or more Load Balancer nodes[^2]
- 1 Monitoring node (Prometheus, Grafana) - 1 Monitoring node (Prometheus, Grafana)
![Horizontal architecture diagram](img/horizontal.png) ![Horizontal architecture diagram](img/horizontal.png)
...@@ -156,8 +167,10 @@ contention due to certain workloads. ...@@ -156,8 +167,10 @@ contention due to certain workloads.
- 2 Redis nodes - 2 Redis nodes
- 3 Consul/Sentinel nodes - 3 Consul/Sentinel nodes
- 2 or more Sidekiq nodes - 2 or more Sidekiq nodes
- 2 or more GitLab application nodes (Unicorn, Workhorse) - 2 or more GitLab application nodes (Unicorn / Puma, Workhorse, Sidekiq)
- 1 or more NFS/Gitaly servers - 1 Gitaly storage servers
- 1 Object Storage service[^3] and / or NFS storage server[^4]
- 1 or more Load Balancer nodes[^2]
- 1 Monitoring node (Prometheus, Grafana) - 1 Monitoring node (Prometheus, Grafana)
![Hybrid architecture diagram](img/hybrid.png) ![Hybrid architecture diagram](img/hybrid.png)
...@@ -177,45 +190,40 @@ with the added complexity of many more nodes to configure, manage, and monitor. ...@@ -177,45 +190,40 @@ with the added complexity of many more nodes to configure, manage, and monitor.
- 2 or more Git nodes (Git over SSH/Git over HTTP) - 2 or more Git nodes (Git over SSH/Git over HTTP)
- 2 or more API nodes (All requests to `/api`) - 2 or more API nodes (All requests to `/api`)
- 2 or more Web nodes (All other web requests) - 2 or more Web nodes (All other web requests)
- 2 or more NFS/Gitaly servers - 2 or more Gitaly storage servers
- 1 or more Object Storage services[^3] and / or NFS storage servers[^4]
- 1 or more Load Balancer nodes[^2]
- 1 Monitoring node (Prometheus, Grafana) - 1 Monitoring node (Prometheus, Grafana)
![Fully Distributed architecture diagram](img/fully-distributed.png) ![Fully Distributed architecture diagram](img/fully-distributed.png)
The following pages outline the steps necessary to configure each component ## Reference Architecture Examples
separately:
1. [Configure the database](database.md) The Support and Quality teams build, performance test, and validate Reference
1. [Configure Redis](redis.md) Architectures that support set large numbers of users. The specifications below are a
1. [Configure Redis for GitLab source installations](redis_source.md) representation of this work so far and may be adjusted in the future based on
1. [Configure NFS](nfs.md) additional testing and iteration.
1. [NFS Client and Host setup](nfs_host_client_setup.md)
1. [Configure the GitLab application servers](gitlab.md)
1. [Configure the load balancers](load_balancer.md)
1. [Monitoring node (Prometheus and Grafana)](monitoring_node.md)
## Reference Architecture Examples The architectures have been tested with specific coded workloads. The throughputs
used for testing are calculated based on sample customer data. We test each endpoint
type with the following number of requests per second (RPS) per 1000 users:
These reference architecture examples rely on the general rule that approximately 2 requests per second (RPS) of load is generated for every 100 users. - API: 20 RPS
- Web: 2 RPS
- Git: 2 RPS
The specifications here were performance tested against a specific coded Note that your exact needs may be more, depending on your workload. Your
workload. Your exact needs may be more, depending on your workload. Your
workload is influenced by factors such as - but not limited to - how active your workload is influenced by factors such as - but not limited to - how active your
users are, how much automation you use, mirroring, and repo/change size. users are, how much automation you use, mirroring, and repo/change size.
### 10,000 User Configuration ### 10,000 User Configuration
- **Supported Users (approximate):** 10,000 - **Supported Users (approximate):** 10,000
- **RPS:** 200 requests per second - **Test RPS Rates:** API: 200 RPS, Web: 20 RPS, Git: 20 RPS
- **Known Issues:** While validating the reference architecture, slow API endpoints - **Known Issues:** While validating the reference architecture, slow API endpoints
were discovered. For details, see the related issues list in were discovered. For details, see the related issues list in
[this issue](https://gitlab.com/gitlab-org/gitlab-foss/issues/64335). [this issue](https://gitlab.com/gitlab-org/gitlab-foss/issues/64335).
The Support and Quality teams built, performance tested, and validated an
environment that supports about 10,000 users. The specifications below are a
representation of the work so far. The specifications may be adjusted in the
future based on additional testing and iteration.
| Service | Configuration | GCP type | | Service | Configuration | GCP type |
| ------------------------------|-------------------------|----------------| | ------------------------------|-------------------------|----------------|
| 3 GitLab Rails <br> - Puma workers on each node set to 90% of available CPUs with 16 threads | 32 vCPU, 28.8GB Memory | n1-highcpu-32 | | 3 GitLab Rails <br> - Puma workers on each node set to 90% of available CPUs with 16 threads | 32 vCPU, 28.8GB Memory | n1-highcpu-32 |
...@@ -226,30 +234,23 @@ future based on additional testing and iteration. ...@@ -226,30 +234,23 @@ future based on additional testing and iteration.
| 3 Redis Persistent + Sentinel | 4 vCPU, 15GB Memory | n1-standard-4 | | 3 Redis Persistent + Sentinel | 4 vCPU, 15GB Memory | n1-standard-4 |
| 4 Sidekiq | 4 vCPU, 15GB Memory | n1-standard-4 | | 4 Sidekiq | 4 vCPU, 15GB Memory | n1-standard-4 |
| 3 Consul | 2 vCPU, 1.8GB Memory | n1-highcpu-2 | | 3 Consul | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
| 1 NFS Server | 16 vCPU, 14.4GB Memory | n1-highcpu-16 | | 1 NFS Server | 4 CPU, 3.6GB Memory | n1-highcpu-4 |
| X S3 Object Storage[^3] | - | - |
| 1 Monitoring node | 4 CPU, 3.6GB Memory | n1-highcpu-4 | | 1 Monitoring node | 4 CPU, 3.6GB Memory | n1-highcpu-4 |
| 1 Load Balancing node[^2] . | 2 vCPU, 1.8GB Memory | n1-highcpu-2 | | 1 Load Balancing node[^2] | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
NOTE: **Note:** Memory values are given directly by GCP machine sizes. On different cloud
vendors a best effort like for like can be used.
### 25,000 User Configuration ### 25,000 User Configuration
- **Supported Users (approximate):** 25,000 - **Supported Users (approximate):** 25,000
- **RPS:** 500 requests per second - **Test RPS Rates:** API: 500 RPS, Web: 50 RPS, Git: 50 RPS
- **Known Issues:** The slow API endpoints that were discovered during testing - **Known Issues:** The slow API endpoints that were discovered during testing
the 10,000 user architecture also affect the 25,000 user architecture. For the 10,000 user architecture also affect the 25,000 user architecture. For
details, see the related issues list in details, see the related issues list in
[this issue](https://gitlab.com/gitlab-org/gitlab-foss/issues/64335). [this issue](https://gitlab.com/gitlab-org/gitlab-foss/issues/64335).
The GitLab Support and Quality teams built, performance tested, and validated an
environment that supports around 25,000 users. The specifications below are a
representation of the work so far. The specifications may be adjusted in the
future based on additional testing and iteration.
NOTE: **Note:** The specifications here were performance tested against a
specific coded workload. Your exact needs may be more, depending on your
workload. Your workload is influenced by factors such as - but not limited to -
how active your users are, how much automation you use, mirroring, and
repo/change size.
| Service | Configuration | GCP type | | Service | Configuration | GCP type |
| ------------------------------|-------------------------|----------------| | ------------------------------|-------------------------|----------------|
| 7 GitLab Rails <br> - Puma workers on each node set to 90% of available CPUs with 16 threads | 32 vCPU, 28.8GB Memory | n1-highcpu-32 | | 7 GitLab Rails <br> - Puma workers on each node set to 90% of available CPUs with 16 threads | 32 vCPU, 28.8GB Memory | n1-highcpu-32 |
...@@ -260,23 +261,24 @@ repo/change size. ...@@ -260,23 +261,24 @@ repo/change size.
| 3 Redis Persistent + Sentinel | 4 vCPU, 15GB Memory | n1-standard-4 | | 3 Redis Persistent + Sentinel | 4 vCPU, 15GB Memory | n1-standard-4 |
| 4 Sidekiq | 4 vCPU, 15GB Memory | n1-standard-4 | | 4 Sidekiq | 4 vCPU, 15GB Memory | n1-standard-4 |
| 3 Consul | 2 vCPU, 1.8GB Memory | n1-highcpu-2 | | 3 Consul | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
| 1 NFS Server | 16 vCPU, 14.4GB Memory | n1-highcpu-16 | | 1 NFS Server | 4 CPU, 3.6GB Memory | n1-highcpu-4 |
| X S3 Object Storage[^4] | - | - |
| 1 Monitoring node | 4 CPU, 3.6GB Memory | n1-highcpu-4 | | 1 Monitoring node | 4 CPU, 3.6GB Memory | n1-highcpu-4 |
| 1 Load Balancing node[^2] . | 2 vCPU, 1.8GB Memory | n1-highcpu-2 | | 1 Load Balancing node[^2] | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
NOTE: **Note:** Memory values are given directly by GCP machine sizes. On different cloud
vendors a best effort like for like can be used.
### 50,000 User Configuration ### 50,000 User Configuration
- **Supported Users (approximate):** 50,000 - **Supported Users (approximate):** 50,000
- **RPS:** 1,000 requests per second - **Test RPS Rates:** API: 1000 RPS, Web: 100 RPS, Git: 100 RPS
- **Status:** Work-in-progress - **Status:** Work-in-progress
- **Related Issue:** See the [related issue](https://gitlab.com/gitlab-org/quality/performance/issues/66) for more information. - **Related Issue:** See the [related issue](https://gitlab.com/gitlab-org/quality/performance/issues/66) for more information.
The Support and Quality teams are in the process of building and performance NOTE: **Note:** This architecture is a work-in-progress of the work so far. The
testing an environment that will support around 50,000 users. The specifications Quality team will be certifying this environment in late 2019. The specifications
below are a very rough work-in-progress representation of the work so far. The may be adjusted prior to certification based on performance testing.
Quality team will be certifying this environment in late 2019. The
specifications may be adjusted prior to certification based on performance
testing.
| Service | Configuration | GCP type | | Service | Configuration | GCP type |
| ------------------------------|-------------------------|----------------| | ------------------------------|-------------------------|----------------|
...@@ -288,9 +290,13 @@ testing. ...@@ -288,9 +290,13 @@ testing.
| 3 Redis Persistent + Sentinel | 4 vCPU, 15GB Memory | n1-standard-4 | | 3 Redis Persistent + Sentinel | 4 vCPU, 15GB Memory | n1-standard-4 |
| 4 Sidekiq | 4 vCPU, 15GB Memory | n1-standard-4 | | 4 Sidekiq | 4 vCPU, 15GB Memory | n1-standard-4 |
| 3 Consul | 2 vCPU, 1.8GB Memory | n1-highcpu-2 | | 3 Consul | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
| 1 NFS Server | 16 vCPU, 14.4GB Memory | n1-highcpu-16 | | 1 NFS Server | 4 CPU, 3.6GB Memory | n1-highcpu-4 |
| X S3 Object Storage[^3] | - | - |
| 1 Monitoring node | 4 CPU, 3.6GB Memory | n1-highcpu-4 | | 1 Monitoring node | 4 CPU, 3.6GB Memory | n1-highcpu-4 |
| 1 Load Balancing node[^2] . | 2 vCPU, 1.8GB Memory | n1-highcpu-2 | | 1 Load Balancing node[^2] | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
NOTE: **Note:** Memory values are given directly by GCP machine sizes. On different cloud
vendors a best effort like for like can be used.
[^1]: Gitaly node requirements are dependent on customer data. We recommend 2 [^1]: Gitaly node requirements are dependent on customer data. We recommend 2
nodes as an absolute minimum for performance at the 10,000 and 25,000 user nodes as an absolute minimum for performance at the 10,000 and 25,000 user
...@@ -298,5 +304,19 @@ testing. ...@@ -298,5 +304,19 @@ testing.
additional nodes should be considered in conjunction with a review of additional nodes should be considered in conjunction with a review of
project counts and sizes. project counts and sizes.
[^2]: HAProxy is the only tested and recommended load balancer. Additional [^2]: Our architectures have been tested and validated with [HAProxy](https://www.haproxy.org/)
options may be supported in the future. as the load balancer. However other reputable load balancers with similar feature sets
should also work here but be aware these aren't validated.
[^3]: For data objects such as LFS, Uploads, Artifacts, etc... We recommend a S3 Object Storage
where possible over NFS due to better performance and availability. Several types of objects
are supported for S3 storage - [Job artifacts](../job_artifacts.md#using-object-storage),
[LFS](../lfs/lfs_administration.md#storing-lfs-objects-in-remote-object-storage),
[Uploads](../uploads.md#using-object-storage-core-only),
[Merge Request Diffs](../merge_request_diffs.md#using-object-storage),
[Packages](../packages/index.md#using-object-storage) (Optional Feature),
[Dependency Proxy](../packages/dependency_proxy.md#using-object-storage) (Optional Feature).
[^4]: NFS storage server is still required for [GitLab Pages](https://gitlab.com/gitlab-org/gitlab-pages/issues/196)
and optionally for CI Job Incremental Logging
([can be switched to use Redis instead](https://docs.gitlab.com/ee/administration/job_logs.html#new-incremental-logging-architecture)).
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment