Commit fa0923dd authored by Stan Hu's avatar Stan Hu

Merge branch '42896-hashed-storage-beta-docs' into 'master'

Resolve "Hashed storage is now beta"

Closes #42896

See merge request gitlab-org/gitlab-ce!17015
parents c5ad5a08 4e3ad326
...@@ -4,50 +4,63 @@ ...@@ -4,50 +4,63 @@
## Legacy Storage ## Legacy Storage
Legacy Storage is the storage behavior prior to version 10.0. For historical reasons, GitLab replicated the same Legacy Storage is the storage behavior prior to version 10.0. For historical
mapping structure from the projects URLs: reasons, GitLab replicated the same mapping structure from the projects URLs:
* Project's repository: `#{namespace}/#{project_name}.git` * Project's repository: `#{namespace}/#{project_name}.git`
* Project's wiki: `#{namespace}/#{project_name}.wiki.git` * Project's wiki: `#{namespace}/#{project_name}.wiki.git`
This structure made simple to migrate from existing solutions to GitLab and easy for Administrators to find where the This structure made it simple to migrate from existing solutions to GitLab and
repository is stored. easy for Administrators to find where the repository is stored.
On the other hand this has some drawbacks: On the other hand this has some drawbacks:
Storage location will concentrate huge amount of top-level namespaces. The impact can be reduced by the introduction of [multiple storage paths][storage-paths]. Storage location will concentrate huge amount of top-level namespaces. The
impact can be reduced by the introduction of [multiple storage
paths][storage-paths].
Because Backups are a snapshot of the same URL mapping, if you try to recover a very old backup, you need to verify Because backups are a snapshot of the same URL mapping, if you try to recover a
if any project has taken the place of an old removed project sharing the same URL. This means that `mygroup/myproject` very old backup, you need to verify whether any project has taken the place of
from your backup may not be the same original project that is today in the same URL. an old removed or renamed project sharing the same URL. This means that
`mygroup/myproject` from your backup may not be the same original project that
is at that same URL today.
Any change in the URL will need to be reflected on disk (when groups / users or projects are renamed). This can add a lot Any change in the URL will need to be reflected on disk (when groups / users or
of load in big installations, and can be even worst if they are using any type of network based filesystem. projects are renamed). This can add a lot of load in big installations,
especially if using any type of network based filesystem.
Last, for GitLab Geo, this storage type means we have to synchronize the disk state, replicate renames in the correct For GitLab Geo in particular: Geo does work with legacy storage, but in some
order or we may end-up with wrong repository or missing data temporarily. edge cases due to race conditions it can lead to errors when a project is
renamed multiple times in short succession, or a project is deleted and
recreated under the same name very quickly. We expect these race events to be
rare, and we have not observed a race condition side-effect happening yet.
This pattern also exists in other objects stored in GitLab, like issue Attachments, GitLab Pages artifacts, This pattern also exists in other objects stored in GitLab, like issue
Docker Containers for the integrated Registry, etc. Attachments, GitLab Pages artifacts, Docker Containers for the integrated
Registry, etc.
## Hashed Storage ## Hashed Storage
Hashed Storage is the new storage behavior we are rolling out with 10.0. It's not enabled by default yet, but we > **Warning:** Hashed storage is in **Beta**. For the latest updates, check the
encourage everyone to try-it and take the time to fix any script you may have that depends on the old behavior. > associated [issue](https://gitlab.com/gitlab-com/infrastructure/issues/2821)
> and please report any problems you encounter.
Instead of coupling project URL and the folder structure where the repository will be stored on disk, we are coupling Hashed Storage is the new storage behavior we are rolling out with 10.0. Instead
a hash, based on the project's ID. of coupling project URL and the folder structure where the repository will be
stored on disk, we are coupling a hash, based on the project's ID. This makes
the folder structure immutable, and therefore eliminates any requirement to
synchronize state from URLs to disk structure. This means that renaming a group,
user, or project will cost only the database transaction, and will take effect
immediately.
This makes the folder structure immutable, and therefore eliminates any requirement to synchronize state from URLs to The hash also helps to spread the repositories more evenly on the disk, so the
disk structure. This means that renaming a group, user or project will cost only the database transaction, and will take top-level directory will contain less folders than the total amount of top-level
effect immediately. namespaces.
The hash also helps to spread the repositories more evenly on the disk, so the top-level directory will contain less The hash format is based on the hexadecimal representation of SHA256:
folders than the total amount of top-level namespaces. `SHA256(project.id)`. The top-level folder uses the first 2 characters, followed
by another folder with the next 2 characters. They are both stored in a special
Hash format is based on hexadecimal representation of SHA256: `SHA256(project.id)`. `@hashed` folder, to be able to co-exist with existing Legacy Storage projects:
Top-level folder uses first 2 characters, followed by another folder with the next 2 characters. They are both stored in
a special folder `@hashed`, to co-exist with existing Legacy projects:
```ruby ```ruby
# Project's repository: # Project's repository:
...@@ -57,15 +70,13 @@ a special folder `@hashed`, to co-exist with existing Legacy projects: ...@@ -57,15 +70,13 @@ a special folder `@hashed`, to co-exist with existing Legacy projects:
"@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.wiki.git" "@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.wiki.git"
``` ```
This new format also makes possible to restore backups with confidence, as when restoring a repository from the backup,
you will never mistakenly restore a repository in the wrong project (considering the backup is made after the migration).
### How to migrate to Hashed Storage ### How to migrate to Hashed Storage
In GitLab, go to **Admin > Settings**, find the **Repository Storage** section and select In GitLab, go to **Admin > Settings**, find the **Repository Storage** section
"_Create new projects using hashed storage paths_". and select "_Create new projects using hashed storage paths_".
To migrate your existing projects to the new storage type, check the specific [rake tasks]. To migrate your existing projects to the new storage type, check the specific
[rake tasks].
[ce-28283]: https://gitlab.com/gitlab-org/gitlab-ce/issues/28283 [ce-28283]: https://gitlab.com/gitlab-org/gitlab-ce/issues/28283
[rake tasks]: raketasks/storage.md#migrate-existing-projects-to-hashed-storage [rake tasks]: raketasks/storage.md#migrate-existing-projects-to-hashed-storage
...@@ -73,11 +84,13 @@ To migrate your existing projects to the new storage type, check the specific [r ...@@ -73,11 +84,13 @@ To migrate your existing projects to the new storage type, check the specific [r
### Hashed Storage coverage ### Hashed Storage coverage
We are incrementally moving every storable object in GitLab to the Hashed Storage pattern. You can check the current We are incrementally moving every storable object in GitLab to the Hashed
coverage status below. Storage pattern. You can check the current coverage status below (and also see
the [issue](https://gitlab.com/gitlab-com/infrastructure/issues/2821)).
Note that things stored in an S3 compatible endpoint will not have the downsides mentioned earlier, if they are not Note that things stored in an S3 compatible endpoint will not have the downsides
prefixed with `#{namespace}/#{project_name}`, which is true for CI Cache and LFS Objects. mentioned earlier, if they are not prefixed with `#{namespace}/#{project_name}`,
which is true for CI Cache and LFS Objects.
| Storable Object | Legacy Storage | Hashed Storage | S3 Compatible | GitLab Version | | Storable Object | Legacy Storage | Hashed Storage | S3 Compatible | GitLab Version |
| --------------- | -------------- | -------------- | ------------- | -------------- | | --------------- | -------------- | -------------- | ------------- | -------------- |
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment