repository_storage_types.md 8.77 KB
Newer Older
1 2 3 4
# Repository Storage Types

> [Introduced][ce-28283] in GitLab 10.0.

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Two different storage layouts can be used
to store the repositories on disk and their characteristics.

GitLab can be configured to use one or multiple repository shard locations
that can be: 

- Mounted to the local disk
- Exposed as an NFS shared volume
- Acessed via [gitaly] on its own machine.

In GitLab, this is configured in `/etc/gitlab/gitlab.rb` by the `git_data_dirs({})`
configuration hash. The storage layouts discussed here will apply to any shard 
defined in it.

The `default` repository shard that is available in any installations
that haven't customized it, points to the local folder: `/var/opt/gitlab/git-data`.
Anything discussed below is expected to be part of that folder. 

23 24
## Legacy Storage

25 26
Legacy Storage is the storage behavior prior to version 10.0. For historical
reasons, GitLab replicated the same mapping structure from the projects URLs:
27

28 29
- Project's repository: `#{namespace}/#{project_name}.git`
- Project's wiki: `#{namespace}/#{project_name}.wiki.git`
30

31 32
This structure made it simple to migrate from existing solutions to GitLab and
easy for Administrators to find where the repository is stored.
33 34 35

On the other hand this has some drawbacks:

36 37 38
Storage location will concentrate huge amount of top-level namespaces. The
impact can be reduced by the introduction of [multiple storage
paths][storage-paths].
39

40 41 42 43 44
Because backups are a snapshot of the same URL mapping, if you try to recover a
very old backup, you need to verify whether any project has taken the place of
an old removed or renamed project sharing the same URL. This means that
`mygroup/myproject` from your backup may not be the same original project that
is at that same URL today.
45

46 47 48
Any change in the URL will need to be reflected on disk (when groups / users or
projects are renamed). This can add a lot of load in big installations,
especially if using any type of network based filesystem.
49

50 51 52 53 54
For GitLab Geo in particular: Geo does work with legacy storage, but in some
edge cases due to race conditions it can lead to errors when a project is
renamed multiple times in short succession, or a project is deleted and
recreated under the same name very quickly. We expect these race events to be
rare, and we have not observed a race condition side-effect happening yet.
55

56 57 58
This pattern also exists in other objects stored in GitLab, like issue
Attachments, GitLab Pages artifacts, Docker Containers for the integrated
Registry, etc.
59 60 61

## Hashed Storage

62
Hashed Storage is the new storage behavior we rolled out with 10.0. Instead
63 64 65 66 67 68
of coupling project URL and the folder structure where the repository will be
stored on disk, we are coupling a hash, based on the project's ID. This makes
the folder structure immutable, and therefore eliminates any requirement to
synchronize state from URLs to disk structure. This means that renaming a group,
user, or project will cost only the database transaction, and will take effect
immediately.
69

70 71 72
The hash also helps to spread the repositories more evenly on the disk, so the
top-level directory will contain less folders than the total amount of top-level
namespaces.
73

74 75 76 77
The hash format is based on the hexadecimal representation of SHA256:
`SHA256(project.id)`. The top-level folder uses the first 2 characters, followed
by another folder with the next 2 characters. They are both stored in a special
`@hashed` folder, to be able to co-exist with existing Legacy Storage projects:
78 79 80 81 82 83 84 85 86

```ruby
# Project's repository:
"@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git"

# Wiki's repository:
"@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.wiki.git"
```

87
### Hashed object pools
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103

For deduplication of public forks and their parent repository, objects are pooled
in an object pool. These object pools are a third repository where shared objects
are stored.

```ruby
# object pool paths
"@pools/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git"
```

The object pool feature is behind the `object_pools` feature flag, and can be
enabled for individual projects by executing
`Feature.enable(:object_pools, Project.find(<id>))`. Note that the project has to
be on hashed storage, should not be a fork itself, and hashed storage should be
enabled for all new projects.

104
### How to migrate to Hashed Storage
105

106 107 108 109
To start a migration, enable Hashed Storage for new projects:
 
1. Go to **Admin > Settings** and expand the **Repository Storage** section.
2. Select the **Use hashed storage paths for newly created and renamed projects** checkbox.
110

111 112 113
Check if the change breaks any existing integration you may have that
either runs on the same machine as your repositories are located, or may login to that machine
to access data (for example, a remote backup solution).
114

115 116
To schedule a complete rollout, see the
[rake task documentation for storage migration][rake/migrate-to-hashed] for instructions.
117

118 119
If you do have any existing integration, you may want to do a small rollout first,
to validate. You can do so by specifying a range with the operation.
120

121 122
This is an example of how to limit the rollout to Project IDs 50 to 100, running in
an Omnibus Gitlab installation:
123

124 125
```bash
sudo gitlab-rake gitlab:storage:migrate_to_hashed ID_FROM=50 ID_TO=100
126 127
```

128 129 130 131 132 133 134
Check the [documentation][rake/migrate-to-hashed] for additional information and instructions for 
source-based installation.

#### Rollback

Similar to the migration, to disable Hashed Storage for new
projects:
135

136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
1. Go to **Admin > Settings** and expand the **Repository Storage** section.
2. Uncheck the **Use hashed storage paths for newly created and renamed projects** checkbox.

To schedule a complete rollback, see the 
[rake task documentation for storage rollback][rake/rollback-to-legacy] for instructions.

The rollback task also supports specifying a range of Project IDs. Here is an example
of limiting the rollout to Project IDs 50 to 100, in an Omnibus Gitlab installation:
 
```bash
sudo gitlab-rake gitlab:storage:rollback_to_legacy ID_FROM=50 ID_TO=100
```

If you have a Geo setup, please note that the rollback will not be reflected automatically
on the **secondary** node. You may need to wait for a backfill operation to kick-in and remove
the remaining repositories from the special `@hashed/` folder manually.
152

153 154
### Hashed Storage coverage

155 156
We are incrementally moving every storable object in GitLab to the Hashed
Storage pattern. You can check the current coverage status below (and also see
157
the [issue][ce-2821]).
158

159 160 161
Note that things stored in an S3 compatible endpoint will not have the downsides
mentioned earlier, if they are not prefixed with `#{namespace}/#{project_name}`,
which is true for CI Cache and LFS Objects.
162

163 164
| Storable Object | Legacy Storage | Hashed Storage | S3 Compatible | GitLab Version |
| --------------- | -------------- | -------------- | ------------- | -------------- |
165 166
| Repository      | Yes            | Yes            | -             | 10.0           |
| Attachments     | Yes            | Yes            | -             | 10.2           |
167
| Avatars         | Yes            | No             | -             | -              |
168 169
| Pages           | Yes            | No             | -             | -              |
| Docker Registry | Yes            | No             | -             | -              |
170
| CI Build Logs   | No             | No             | -             | -              |
171
| CI Artifacts    | No             | No             | Yes           | 9.4 / 10.6     |
172
| CI Cache        | No             | No             | Yes           | -              |
173
| LFS Objects     | Yes            | Similar        | Yes           | 10.0 / 10.7    |
174
| Repository pools| No             | Yes            | -             | 11.6           |
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198

#### Implementation Details

##### Avatars

Each file is stored in a folder with its `id` from the database. The filename is always `avatar.png` for user avatars.
When avatar is replaced, `Upload` model is destroyed and a new one takes place with different `id`.

##### CI Artifacts

CI Artifacts are S3 compatible since **9.4** (GitLab Premium), and available in GitLab Core since **10.6**.

##### LFS Objects

LFS Objects implements a similar storage pattern using 2 chars, 2 level folders, following git own implementation:

```ruby
"shared/lfs-objects/#{oid[0..1}/#{oid[2..3]}/#{oid[4..-1]}"

# Based on object `oid`: `8909029eb962194cfb326259411b22ae3f4a814b5be4f80651735aeef9f3229c`, path will be:
"shared/lfs-objects/89/09/029eb962194cfb326259411b22ae3f4a814b5be4f80651735aeef9f3229c"
```

They are also S3 compatible since **10.0** (GitLab Premium), and available in GitLab Core since **10.7**.
199 200 201 202 203 204 205

[ce-2821]: https://gitlab.com/gitlab-com/infrastructure/issues/2821
[ce-28283]: https://gitlab.com/gitlab-org/gitlab-ce/issues/28283
[rake/migrate-to-hashed]: raketasks/storage.md#migrate-existing-projects-to-hashed-storage
[rake/rollback-to-legacy]: raketasks/storage.md#rollback
[storage-paths]: repository_storage_types.md
[gitaly]: gitaly/index.md