Geo: Document how our Git push proxy works

f4eb7f77 · Gabriel Mazetto · aa672175 · f4eb7f77
Commit f4eb7f77 authored Apr 27, 2019 by Gabriel Mazetto
Show whitespace changes
Inline Side-by-side

Showing with 73 additions and 27 deletions

doc/development/geo.md doc/development/geo.md +73 -27

No files found.
--- a/doc/development/geo.md
+++ b/doc/development/geo.md
@@ -2,10 +2,48 @@

 Geo connects GitLab instances together. One GitLab instance is
 designated as a **primary** node and can be run with multiple
-**secondary** nodes. Geo orchestrates quite a few components that are
-described in more detail below.
+**secondary** nodes. Geo orchestrates quite a few components that can be seen on
+the diagram below and are described in more detail within this document.

-## Database replication
+![Geo Architecture Diagram](../administration/geo/replication/img/geo_architecture.png)
+
+## Replication layer
+
+Geo handles replication for different components:
+- [Database](#database-replication): includes the entire application, except cache and jobs.
+- [Git repositories](#repository-replication): includes both projects and wikis.
+- [Uploaded blobs](#uploads-replication): includes anything from images attached on issues
+to raw logs and assets from CI.
+
+With the exception of the Database replication, on a *secondary* node, everything is coordinated
+by the [Geo Log Cursor](#geo-log-cursor). 
+
+### Geo Log Cursor daemon
+
+The [Geo Log Cursor daemon](#geo-log-cursor-daemon) is a separate process running on
+each **secondary** node. It monitors the [Geo Event Log](#geo-event-log)
+for new events and creates background jobs for each specific event type.
+
+For example when a repository is updated, the Geo **primary** node creates
+a Geo event with an associated repository updated event. The Geo Log Cursor daemon
+picks the event up and schedules a `Geo::ProjectSyncWorker` job which will
+use the `Geo::RepositorySyncService` and `Geo::WikiSyncService` classes
+to update the repository and the wiki respectively.
+
+The Geo Log Cursor daemon can operate in High Availability mode automatically. 
+The daemon will try to acquire a lock from time to time and once acquired, it 
+will behave as the *active* daemon.
+
+Any additional running daemons on the same node, will be in standby
+mode, ready to resume work if the *active* daemon releases its lock.
+
+We use the [`ExclusiveLease`](https://www.rubydoc.info/github/gitlabhq/gitlabhq/Gitlab/ExclusiveLease) lock type with a small TTL, that is renewed at every
+pooling cycle. That allows us to implement this global lock with a timeout.
+
+At the end of the pooling cycle, if the daemon can't renew and/or reacquire
+the lock, it switches to standby mode.
+
+### Database replication

 Geo uses [streaming replication](#streaming-replication) to replicate
 the database from the **primary** to the **secondary** nodes. This
@@ -13,7 +51,7 @@ replication gives the **secondary** nodes access to all the data saved
 in the database. So users can log in on the **secondary** and read all
 the issues, merge requests, etc. on the **secondary** node.

-## Repository replication
+### Repository replication

 Geo also replicates repositories. Each **secondary** node keeps track of
 the state of every repository in the [tracking database](#tracking-database).
@@ -23,7 +61,7 @@ There are a few ways a repository gets replicated by the:
 - [Repository Sync worker](#repository-sync-worker).
 - [Geo Log Cursor](#geo-log-cursor).

-### Project Registry
+#### Project Registry

 The `Geo::ProjectRegistry` class defines the model used to track the
 state of repository replication. For each project in the main
@@ -32,15 +70,15 @@ database, one record in the tracking database is kept.
 It records the following about repositories:

 - The last time they were synced.
- The last time they were synced successfully.
+- The last time they were successfully synced.
 - If they need to be resynced.
- When retry should be attempted.
+- When a retry should be attempted.
 - The number of retries.
- If and when the they were verified.
+- If and when they were verified.

 It also stores these attributes for project wikis in dedicated columns.

-### Repository Sync worker
+#### Repository Sync worker

 The `Geo::RepositorySyncWorker` class runs periodically in the
 background and it searches the `Geo::ProjectRegistry` model for
@@ -59,26 +97,12 @@ times, Geo does a so-called _redownload_. It will do a clean clone
 into the `@geo-temporary` directory in the root of the storage. When
 it's successful, we replace the main repo with the newly cloned one.

-### Geo Log Cursor
-
-The [Geo Log Cursor](#geo-log-cursor) is a separate process running on
-each **secondary** node. It monitors the [Geo Event Log](#geo-event-log)
-and handles all of the events. When it sees an unhandled event, it
-starts a background worker to handle that event, depending on the type
-of event.
-
-When a repository receives an update, the Geo **primary** node creates
-a Geo event with an associated repository updated event. The cursor
-picks that up, and schedules a `Geo::ProjectSyncWorker` job which will
-use the `Geo::RepositorySyncService` class and `Geo::WikiSyncService`
-class to update the repository and the wiki.
-
-## Uploads replication
+### Uploads replication

 File uploads are also being replicated to the **secondary** node. To
 track the state of syncing, the `Geo::FileRegistry` model is used.

-### File Registry
+#### File Registry

 Similar to the [Project Registry](#project-registry), there is a
 `Geo::FileRegistry` model that tracks the synced uploads.
@@ -86,7 +110,7 @@ Similar to the [Project Registry](#project-registry), there is a
 CI Job Artifacts are synced in a similar way as uploads or LFS
 objects, but they are tracked by `Geo::JobArtifactRegistry` model.

-### File Download Dispatch worker
+#### File Download Dispatch worker

 Also similar to the [Repository Sync worker](#repository-sync-worker),
 there is a `Geo::FileDownloadDispatchWorker` class that is run
@@ -113,7 +137,7 @@ Authorization: GL-Geo <access_key>:<JWT payload>
 ```

 The **primary** node uses the `access_key` field to look up the
-corresponding Geo **secondary** node and decrypts the JWT payload,
+corresponding **secondary** node and decrypts the JWT payload,
 which contains additional information to identify the file
 request. This ensures that the **secondary** node downloads the right
 file for the right database ID. For example, for an LFS object, the
@@ -133,6 +157,28 @@ NOTE: **Note:**
 JWT requires synchronized clocks between the machines
 involved, otherwise it may fail with an encryption error.

+## Git Push to Geo secondary
+
+The Git Push Proxy exists as a functionality built inside the `gitlab-shell` component.
+It is active on a **secondary** node only. It allows the user that has cloned a repository
+from the secondary node to push to the same URL.
+
+Git `push` requests directed to a **secondary** node will be sent over to the **primary** node, 
+while `pull` requests will continue to be served by the **secondary** node for maximum efficiency.
+
+HTTPS and SSH requests are handled differently:
+
+- With HTTPS, we will give the user a `HTTP 302 Redirect` pointing to the project on the **primary** node.
+The git client is wise enough to understand that status code and process the redirection.
+- With SSH, because there is no equivalent way to perform a redirect, we have to proxy the request.
+This is done inside [`gitlab-shell`](https://gitlab.com/gitlab-org/gitlab-shell), by first translating the request 
+to the HTTP protocol, and then proxying it to the **primary** node.
+
+The [`gitlab-shell`](https://gitlab.com/gitlab-org/gitlab-shell) daemon knows when to proxy based on the response 
+from `/api/v4/allowed`. A special `HTTP 300` status code is returned and we execute a "custom action", 
+specified in the response body. The response contains additional data that allows the proxied `push` operation 
+to happen on the **primary** node.
+
 ## Using the Tracking Database

 Along with the main database that is replicated, a Geo **secondary**