Commit 7641233f authored by Achilleas Pipinellis's avatar Achilleas Pipinellis

Merge branch 'DarwinJS-update-aws-implementation-patterns' into 'master'

Repurpose Gitaly SRE Subpage to GitLab AWS SRE Subpage

See merge request gitlab-org/gitlab!69250
parents fe12ce19 b79ae243
......@@ -148,11 +148,11 @@ If EKS node autoscaling is employed, it is likely that your average loading will
| **Bastion Host (Quick Start)** | 1 HA instance in ASG | **t2.micro** for prod, **m4.2xlarge** for perf. testing | | |
| **PostgreSQL**<br />AWS Aurora RDS Nodes Configuration (GPT tested) | 2vCPU, 7.5 GB<br />Tested with Graviton ARM | **db.r6g.large** x 3 nodes <br />(6vCPU, 48 GB) | 3 nodes x $0.26 = $0.78/hr | 3 nodes x $0.26 = $0.78/hr<br />(Aurora is always 3) |
| **Redis** | 1vCPU, 3.75GB<br />(across 12 nodes for Redis Cache, Redis Queues/Shared State, Sentinel Cache, Sentinel Queues/Shared State) | **cache.m6g.large** x 3 nodes<br />(6vCPU, 19GB) | 3 nodes x $0.15 = $0.45/hr | 2 nodes x $0.15 = $0.30/hr |
| **<u>Gitaly Cluster</u>** [Details](gitaly_on_aws.md) | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) | | | |
| **<u>Gitaly Cluster</u>** [Details](gitlab_sre_for_aws.md#gitaly-sre-considerations) | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) | | | |
| Gitaly Instances (in ASG) | 12 vCPU, 45GB<br />(across 3 nodes) | **m5.xlarge** x 3 nodes<br />(48 vCPU, 180 GB) | $0.192 x 3 = $0.58/hr | $0.192 x 3 = $0.58/hr |
| | The GitLab Reference architecture for 2K is not Highly Available and therefore has a single Gitaly no Praefect. AWS Quick Starts MUST be HA, so it implements Prafect from the 3K Ref Architecture to meet that requirement | | | |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 10 GB<br />(across 3 nodes) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | $0.09 x 3 = $0.21/hr |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />(across 3 nodes) | N/A Reuses GitLab PostgreSQL | $0 | $0 |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 10 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | $0.09 x 3 = $0.21/hr |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | N/A Reuses GitLab PostgreSQL | $0 | $0 |
| Internal Load Balancing Node | 2 vCPU, 1.8 GB | AWS ELB | $0.10/hr | $0.10/hr |
### 3K Cloud Native Hybrid on EKS
......@@ -200,10 +200,10 @@ If EKS node autoscaling is employed, it is likely that your average loading will
| **Bastion Host (Quick Start)** | 1 HA instance in ASG | **t2.micro** for prod, **m4.2xlarge** for perf. testing | | |
| **PostgreSQL**<br />AWS Aurora RDS Nodes Configuration (GPT tested) | 18vCPU, 36 GB <br />(across 9 nodes for PostgreSQL, PgBouncer, Consul)<br />Tested with Graviton ARM | **db.r6g.xlarge** x 3 nodes <br />(12vCPU, 96 GB) | 3 nodes x $0.52 = $1.56/hr | 3 nodes x $0.52 = $1.56/hr<br />(Aurora is always 3) |
| **Redis** | 6vCPU, 18GB<br />(across 6 nodes for Redis Cache, Sentinel) | **cache.m6g.large** x 3 nodes<br />(6vCPU, 19GB) | 3 nodes x $0.15 = $0.45/hr | 2 nodes x $0.15 = $0.30/hr |
| **<u>Gitaly Cluster</u>** [Details](gitaly_on_aws.md) | | | | |
| Gitaly Instances (in ASG) | 12 vCPU, 45GB<br />(across 3 nodes) | **m5.large** x 3 nodes<br />(12 vCPU, 48 GB) | $0.192 x 3 = $0.58/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 5.4 GB<br />(across 3 nodes) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />(across 3 nodes) | N/A Reuses GitLab PostgreSQL | $0 | |
| **<u>Gitaly Cluster</u>** [Details](gitlab_sre_for_aws.md#gitaly-sre-considerations) | | | | |
| Gitaly Instances (in ASG) | 12 vCPU, 45GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | **m5.large** x 3 nodes<br />(12 vCPU, 48 GB) | $0.192 x 3 = $0.58/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 5.4 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | N/A Reuses GitLab PostgreSQL | $0 | |
| Internal Load Balancing Node | 2 vCPU, 1.8 GB | AWS ELB | $0.10/hr | $0.10/hr |
### 5K Cloud Native Hybrid on EKS
......@@ -251,10 +251,10 @@ If EKS node autoscaling is employed, it is likely that your average loading will
| **Bastion Host (Quick Start)** | 1 HA instance in ASG | **t2.micro** for prod, **m4.2xlarge** for perf. testing | | |
| **PostgreSQL**<br />AWS Aurora RDS Nodes Configuration (GPT tested) | 21vCPU, 51 GB <br />(across 9 nodes for PostgreSQL, PgBouncer, Consul)<br />Tested with Graviton ARM | **db.r6g.2xlarge** x 3 nodes <br />(24vCPU, 192 GB) | 3 nodes x $1.04 = $3.12/hr | 3 nodes x $1.04 = $3.12/hr<br />(Aurora is always 3) |
| **Redis** | 9vCPU, 27GB<br />(across 6 nodes for Redis, Sentinel) | **cache.m6g.xlarge** x 3 nodes<br />(12vCPU, 39GB) | 3 nodes x $0.30 = $0.90/hr | 2 nodes x $0.30 = $0.60/hr |
| **<u>Gitaly Cluster</u>** [Details](gitaly_on_aws.md) | | | | |
| Gitaly Instances (in ASG) | 24 vCPU, 90GB<br />(across 3 nodes) | **m5.2xlarge** x 3 nodes<br />(24 vCPU, 96GB) | $0.384 x 3 = $1.15/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 5.4 GB<br />(across 3 nodes) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />(across 3 nodes) | N/A Reuses GitLab PostgreSQL | $0 | |
| **<u>Gitaly Cluster</u>** [Details](gitlab_sre_for_aws.md#gitaly-sre-considerations) | | | | |
| Gitaly Instances (in ASG) | 24 vCPU, 90GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | **m5.2xlarge** x 3 nodes<br />(24 vCPU, 96GB) | $0.384 x 3 = $1.15/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 5.4 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | N/A Reuses GitLab PostgreSQL | $0 | |
| Internal Load Balancing Node | 2 vCPU, 1.8 GB | AWS ELB | $0.10/hr | $0.10/hr |
### 10K Cloud Native Hybrid on EKS
......@@ -301,10 +301,10 @@ If EKS node autoscaling is employed, it is likely that your average loading will
| **Bastion Host (Quick Start)** | 1 HA instance in ASG | **t2.micro** for prod, **m4.2xlarge** for perf. testing | | |
| **PostgreSQL**<br />AWS Aurora RDS Nodes Configuration (GPT tested) | 36vCPU, 102 GB <br />(across 9 nodes for PostgreSQL, PgBouncer, Consul) | **db.r6g.2xlarge** x 3 nodes <br />(24vCPU, 192 GB) | 3 nodes x $1.04 = $3.12/hr | 3 nodes x $1.04 = $3.12/hr<br />(Aurora is always 3) |
| **Redis** | 30vCPU, 114GB<br />(across 12 nodes for Redis Cache, Redis Queues/Shared State, Sentinel Cache, Sentinel Queues/Shared State) | **cache.m5.2xlarge** x 3 nodes<br />(24vCPU, 78GB) | 3 nodes x $0.62 = $1.86/hr | 2 nodes x $0.62 = $1.24/hr |
| **<u>Gitaly Cluster</u>** [Details](gitaly_on_aws.md) | | | | |
| Gitaly Instances (in ASG) | 48 vCPU, 180GB<br />(across 3 nodes) | **m5.4xlarge** x 3 nodes<br />(48 vCPU, 180 GB) | $0.77 x 3 = $2.31/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 5.4 GB<br />(across 3 nodes) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />(across 3 nodes) | N/A Reuses GitLab PostgreSQL | $0 | |
| **<u>Gitaly Cluster</u>** [Details](gitlab_sre_for_aws.md#gitaly-sre-considerations) | | | | |
| Gitaly Instances (in ASG) | 48 vCPU, 180GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | **m5.4xlarge** x 3 nodes<br />(48 vCPU, 180 GB) | $0.77 x 3 = $2.31/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 6 vCPU, 5.4 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | **c5.large** x 3 nodes<br />(6 vCPU, 12 GB) | $0.09 x 3 = $0.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 6 vCPU, 5.4 GB<br />([across 3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections)) | N/A Reuses GitLab PostgreSQL | $0 | |
| Internal Load Balancing Node | 2 vCPU, 1.8 GB | AWS ELB | $0.10/hr | $0.10/hr |
### 50K Cloud Native Hybrid on EKS
......@@ -351,10 +351,10 @@ If EKS node autoscaling is employed, it is likely that your average loading will
| **Bastion Host (Quick Start)** | 1 HA instance in ASG | **t2.micro** for prod, **m4.2xlarge** for perf. testing | | |
| **PostgreSQL**<br />AWS Aurora RDS Nodes Configuration (GPT tested) | 96vCPU, 360 GB <br />(across 3 nodes) | **db.r6g.8xlarge** x 3 nodes <br />(96vCPU, 768 GB total) | 3 nodes x $4.15 = $12.45/hr | 3 nodes x $4.15 = $12.45/hr<br />(Aurora is always 3) |
| **Redis** | 30vCPU, 114GB<br />(across 12 nodes for Redis Cache, Redis Queues/Shared State, Sentinel Cache, Sentinel Queues/Shared State) | **cache.m6g.2xlarge** x 3 nodes<br />(24vCPU, 78GB total) | 3 nodes x $0.60 = $1.80/hr | 2 nodes x $0.60 = $1.20/hr |
| **<u>Gitaly Cluster</u>** [Details](gitaly_on_aws.md) | | | | |
| Gitaly Instances (in ASG) | 64 vCPU, 240GB x 3 nodes | **m5.16xlarge** x 3 nodes<br />(64 vCPU, 256 GB each) | $3.07 x 3 = $9.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 4 vCPU, 3.6 GB x 3 nodes | **c5.xlarge** x 3 nodes<br />(4 vCPU, 8 GB each) | $0.17 x 3 = $0.51/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitaly_on_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 2 vCPU, 1.8 GB x 3 nodes | N/A Reuses GitLab PostgreSQL | $0 | |
| **<u>Gitaly Cluster</u>** [Details](gitlab_sre_for_aws.md#gitaly-sre-considerations) | | | | |
| Gitaly Instances (in ASG) | 64 vCPU, 240GB x [3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) | **m5.16xlarge** x 3 nodes<br />(64 vCPU, 256 GB each) | $3.07 x 3 = $9.21/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect (Instances in ASG with load balancer) | 4 vCPU, 3.6 GB x [3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) | **c5.xlarge** x 3 nodes<br />(4 vCPU, 8 GB each) | $0.17 x 3 = $0.51/hr | [Gitaly & Praefect Must Have an Uneven Node Count for HA](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) |
| Praefect PostgreSQL(1) (AWS RDS) | 2 vCPU, 1.8 GB x [3 nodes](gitlab_sre_for_aws.md#gitaly-and-praefect-elections) | N/A Reuses GitLab PostgreSQL | $0 | |
| Internal Load Balancing Node | 2 vCPU, 1.8 GB | AWS ELB | $0.10/hr | $0.10/hr |
## Helpful Resources
......
---
type: reference, concepts
stage: Enablement
group: Alliances
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
comments: false
description: Doing SRE for GitLab instances and runners on AWS.
type: index
---
# Gitaly SRE Considerations
# GitLab Site Reliability Engineering for AWS
## Known issues list
Known issues are gathered from within GitLab and from customer reported issues. Customers successfully implement GitLab with a variety of "as a Service" components that GitLab has not specifically been designed for, nor has ongoing testing for. While GitLab does take partner technologies very seriously, the highlighting of known issues here is a convenience for implementers and it does not imply that GitLab has targeted compatibility with, nor carries any type of guarantee of running on the partner technology where the issues occur. Please consult individual issues to understand GitLabs stance and plans on any given known issue.
See the [GitLab AWS known issues list](https://gitlab.com/gitlab-com/alliances/aws/public-tracker/-/issues?label_name%5B%5D=AWS+Known+Issue) for a complete list.
## Gitaly SRE considerations
Gitaly and Gitaly Cluster have been engineered by GitLab to overcome fundamental challenges with horizontal scaling of the open source Git binaries. Here is indepth technical reading on the topic:
## Why Gitaly was built
### Why Gitaly was built
Below are some links to better understand why Gitaly was built:
......@@ -18,17 +28,17 @@ Below are some links to better understand why Gitaly was built:
- [Affects on horizontal compute architecture](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/DESIGN.md#affects-on-horizontal-compute-architecture)
- [Evidence to back building a new horizontal layer to scale Git](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/DESIGN.md#evidence-to-back-building-a-new-horizontal-layer-to-scale-git)
## Gitaly and Praefect elections
### Gitaly and Praefect elections
As part of Gitaly cluster consistency, Praefect nodes will occasionally need to vote on what data copy is the most accurate. This requires an uneven number of Praefect nodes to avoid stalemates. This means that for HA, Gitaly and Praefect require a minimum of three nodes.
## Gitaly performance monitoring
### Gitaly performance monitoring
Complete performance metrics should be collected for Gitaly instances for identification of bottlenecks, as they could have to do with disk IO, network IO or memory.
Gitaly must be implemented on instance compute.
## Gitaly EBS volume sizing guidelines
### Gitaly EBS volume sizing guidelines
Gitaly storage is expected to be local (not NFS of any type including EFS).
Gitaly servers also need disk space for building and caching Git pack files.
......@@ -40,10 +50,10 @@ Background:
- Use Amazon Linux 2 to ensure the best disk and memory optimizations (for example, ENA network adapters and drivers).
- If GitLab backup scripts are used, they need a temporary space location large enough to hold 2 times the current size of the Git File system. If that will be done on Gitaly servers, separate volumes should be used.
## Gitaly HA in EKS quick start
### Gitaly HA in EKS quick start
The AWS EKS quick start for GitLab Cloud Native implements Gitaly as a multi-zone, self-healing infrastructure. It has specific code for reestablishing a Gitaly node when one fails, including AZ failure.
The [AWS GitLab Cloud Native Hybrid on EKS Quick Start](gitlab_hybrid_on_aws.md#available-infrastructure-as-code-for-gitlab-cloud-native-hybrid) for GitLab Cloud Native implements Gitaly as a multi-zone, self-healing infrastructure. It has specific code for reestablishing a Gitaly node when one fails, including AZ failure.
## Gitaly long term management
### Gitaly long term management
Gitaly node disk sizes will need to be monitored and increased to accommodate Git repository growth and Gitaly temporary and caching storage needs. The storage configuration on all nodes should be kept identical.
......@@ -23,14 +23,18 @@ Implementation patterns are built on the foundational information and testing do
[Omnibus GitLab on AWS EC2 (HA)](manual_install_aws.md) - instructions for installing GitLab on EC2 instances. Manual instructions from which you may build out a GitLab instance or create your own Infrastructure as Code (IaC).
### Gitaly SRE considerations for AWS
### GitLab Site Reliability Engineering (SRE) for AWS
[Gitaly SRE Considerations for AWS](gitaly_on_aws.md) - important information for implementing and managing GitLab Gitaly on AWS.
[GitLab SRE Considerations for AWS](gitlab_sre_for_aws.md) - important information and known issues for planning, implementing, upgrading and long term management of GitLab instances and runners on AWS.
### EKS cluster provisioning best practices
[EKS Cluster Provisioning Patterns](eks_clusters_aws.md) - considerations for setting up EKS cluster for runners and for integrating.
### Scaling HA GitLab Runner on AWS EC2 ASG
The following repository is self-contained in regard to enabling this pattern: [GitLab HA Scaling Runner Vending Machine for AWS EC2 ASG](https://gitlab.com/guided-explorations/aws/gitlab-runner-autoscaling-aws-asg/). The [feature list for this implementation pattern](https://gitlab.com/guided-explorations/aws/gitlab-runner-autoscaling-aws-asg/-/blob/main/FEATURES.md) is good to review to understand the complete value it can deliver.
## Additional details on implementation patterns
GitLab implementation patterns build upon [GitLab Reference Architectures](../../administration/reference_architectures/index.md) in the following ways.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment