elasticsearch.md 12.1 KB
Newer Older
1
# Elasticsearch knowledge **(STARTER ONLY)**
2

3
This area is to maintain a compendium of useful information when working with Elasticsearch.
4

5
Information on how to enable Elasticsearch and perform the initial indexing is in
6
the [Elasticsearch integration documentation](../integration/elasticsearch.md#enabling-elasticsearch).
7

8 9
## Deep Dive

10
In June 2019, Mario de la Ossa hosted a Deep Dive (GitLab team members only: `https://gitlab.com/gitlab-org/create-stage/issues/1`) on GitLab's [Elasticsearch integration](../integration/elasticsearch.md) to share his domain specific knowledge with anyone who may work in this part of the code base in the future. You can find the [recording on YouTube](https://www.youtube.com/watch?v=vrvl-tN2EaA), and the slides on [Google Slides](https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit) and in [PDF](https://gitlab.com/gitlab-org/create-stage/uploads/c5aa32b6b07476fa8b597004899ec538/Elasticsearch_Deep_Dive.pdf). Everything covered in this deep dive was accurate as of GitLab 12.0, and while specific details may have changed since then, it should still serve as a good introduction.
11

12
## Supported Versions
13

14
See [Version Requirements](../integration/elasticsearch.md#version-requirements).
15

16
Developers making significant changes to Elasticsearch queries should test their features against all our supported versions.
17

18
## Setting up development environment
19

20
See the [Elasticsearch GDK setup instructions](https://gitlab.com/gitlab-org/gitlab-development-kit/blob/master/doc/howto/elasticsearch.md)
21

22
## Helpful Rake tasks
23 24 25 26

- `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
- `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.

27
Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options)
28 29 30

## How does it work?

31
The Elasticsearch integration depends on an external indexer. We ship an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a Rake task but, after this is done, GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_versioned_search.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/concerns/elastic/application_versioned_search.rb).
32

33
After initial indexing is complete, create, update, and delete operations for all models except projects (see [#207494](https://gitlab.com/gitlab-org/gitlab/-/issues/207494)) are tracked in a Redis [`ZSET`](https://redis.io/topics/data-types#sorted-sets). A regular `sidekiq-cron` `ElasticIndexBulkCronWorker` processes this queue, updating many Elasticsearch documents at a time with the [Bulk Request API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html).
34

35
Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
36 37

## Existing Analyzers/Tokenizers/Filters
Evan Read's avatar
Evan Read committed
38

39
These are all defined in [ee/lib/elastic/latest/config.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/elastic/latest/config.rb)
40 41

### Analyzers
Evan Read's avatar
Evan Read committed
42

43
#### `path_analyzer`
Evan Read's avatar
Evan Read committed
44

45 46 47 48 49
Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.

Please see the `path_tokenizer` explanation below for an example.

#### `sha_analyzer`
Evan Read's avatar
Evan Read committed
50

51 52 53 54 55
Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.

Please see the `sha_tokenizer` explanation later below for an example.

#### `code_analyzer`
Evan Read's avatar
Evan Read committed
56

57
Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: [`code`](#code), `lowercase`, and `asciifolding`
58 59 60 61 62

The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.

Please see the `code` filter for an explanation on how tokens are split.

63 64 65
NOTE: **Known Issues**:
Currently the [Elasticsearch code_analyzer doesn't account for all code cases](../integration/elasticsearch.md#known-issues).

66
#### `code_search_analyzer`
Evan Read's avatar
Evan Read committed
67

68 69 70
Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.

### Tokenizers
Evan Read's avatar
Evan Read committed
71

72
#### `sha_tokenizer`
Evan Read's avatar
Evan Read committed
73

74 75
This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).

Evan Read's avatar
Evan Read committed
76
Example:
77 78

`240c29dc7e` becomes:
Evan Read's avatar
Evan Read committed
79

80 81 82 83 84 85 86 87
- `240c2`
- `240c29`
- `240c29d`
- `240c29dc`
- `240c29dc7`
- `240c29dc7e`

#### `path_tokenizer`
Evan Read's avatar
Evan Read committed
88

89 90
This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.

Evan Read's avatar
Evan Read committed
91
Example:
92 93

`'/some/path/application.js'` becomes:
Evan Read's avatar
Evan Read committed
94

95 96 97 98 99 100
- `'/some/path/application.js'`
- `'some/path/application.js'`
- `'path/application.js'`
- `'application.js'`

### Filters
Evan Read's avatar
Evan Read committed
101

102
#### `code`
Evan Read's avatar
Evan Read committed
103 104

Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
105 106

Patterns:
Evan Read's avatar
Evan Read committed
107

108 109 110 111 112 113 114 115 116
- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
- `"(\\d+)"`: extracts digits
- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
- `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`

#### `edgeNGram_filter`
Evan Read's avatar
Evan Read committed
117

118 119 120 121 122 123 124
Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`

## Gotchas

- Searches can have their own analyzers. Remember to check when editing analyzers
- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches

Mark Chao's avatar
Mark Chao committed
125
## Zero downtime reindexing with multiple indices
Mark Chao's avatar
Mark Chao committed
126

Mark Chao's avatar
Mark Chao committed
127
Currently GitLab can only handle a single version of setting. Any setting/schema changes would require reindexing everything from scratch. Since reindexing can take a long time, this can cause search functionality downtime.
Mark Chao's avatar
Mark Chao committed
128

129 130 131 132 133 134 135
To avoid downtime, GitLab is working to support multiple indices that
can function at the same time. Whenever the schema changes, the admin
will be able to create a new index and reindex to it, while searches
continue to go to the older, stable index. Any data updates will be
forwarded to both indices. Once the new index is ready, an admin can
mark it active, which will direct all searches to it, and remove the old
index.
Mark Chao's avatar
Mark Chao committed
136

Mark Chao's avatar
Mark Chao committed
137
This is also helpful for migrating to new servers, e.g. moving to/from AWS.
Mark Chao's avatar
Mark Chao committed
138

Mark Chao's avatar
Mark Chao committed
139
Currently we are on the process of migrating to this new design. Everything is hardwired to work with one single version for now.
Mark Chao's avatar
Mark Chao committed
140

Mark Chao's avatar
Mark Chao committed
141
### Architecture
Mark Chao's avatar
Mark Chao committed
142

Mark Chao's avatar
Mark Chao committed
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157
The traditional setup, provided by `elasticsearch-rails`, is to communicate through its internal proxy classes. Developers would write model-specific logic in a module for the model to include in (e.g. `SnippetsSearch`). The `__elasticsearch__` methods would return a proxy object, e.g.:

- `Issue.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::ClassMethodsProxy`
- `Issue.first.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::InstanceMethodsProxy`.

These proxy objects would talk to Elasticsearch server directly (see top half of the diagram).

![Elasticsearch Architecture](img/elasticsearch_architecture.svg)

In the planned new design, each model would have a pair of corresponding subclassed proxy objects, in which model-specific logic is located. For example, `Snippet` would have `SnippetClassProxy` and `SnippetInstanceProxy` (being subclass of `Elasticsearch::Model::Proxy::ClassMethodsProxy` and `Elasticsearch::Model::Proxy::InstanceMethodsProxy`, respectively).

`__elasticsearch__` would represent another layer of proxy object, keeping track of multiple actual proxy objects. It would forward method calls to the appropriate index. For example:

- `model.__elasticsearch__.search` would be forwarded to the one stable index, since it is a read operation.
- `model.__elasticsearch__.update_document` would be forwarded to all indices, to keep all indices up-to-date.
Mark Chao's avatar
Mark Chao committed
158 159 160 161 162

The global configurations per version are now in the `Elastic::(Version)::Config` class. You can change mappings there.

### Creating new version of schema

163
NOTE: **Note:** this is not applicable yet as multiple indices functionality is not fully implemented.
Mark Chao's avatar
Mark Chao committed
164

165
Folders like `ee/lib/elastic/v12p1` contain snapshots of search logic from different versions. To keep a continuous Git history, the latest version lives under `ee/lib/elastic/latest`, but its classes are aliased under an actual version (e.g. `ee/lib/elastic/v12p3`). When referencing these classes, never use the `Latest` namespace directly, but use the actual version (e.g. `V12p3`).
Mark Chao's avatar
Mark Chao committed
166 167

The version name basically follows GitLab's release version. If setting is changed in 12.3, we will create a new namespace called `V12p3` (p stands for "point"). Raise an issue if there is a need to name a version differently.
Mark Chao's avatar
Mark Chao committed
168 169 170 171 172 173 174 175

If the current version is `v12p1`, and we need to create a new version for `v12p3`, the steps are as follows:

1. Copy the entire folder of `v12p1` as `v12p3`
1. Change the namespace for files under `v12p3` folder from `V12p1` to `V12p3` (which are still aliased to `Latest`)
1. Delete `v12p1` folder
1. Copy the entire folder of `latest` as `v12p1`
1. Change the namespace for files under `v12p1` folder from `Latest` to `V12p1`
Mark Chao's avatar
Mark Chao committed
176
1. Make changes to files under the `latest` folder as needed
Mark Chao's avatar
Mark Chao committed
177

178 179
## Troubleshooting

180
### Getting `flood stage disk watermark [95%] exceeded`
181 182 183

You might get an error such as

Amy Qualls's avatar
Amy Qualls committed
184
```plaintext
Evan Read's avatar
Evan Read committed
185 186 187
[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]
   flood stage disk watermark [95%] exceeded on
   [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],
188 189 190
   all indices on this node will be marked read-only
```

Evan Read's avatar
Evan Read committed
191
This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.
192

193
In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc
194

Amy Qualls's avatar
Amy Qualls committed
195
```shell
196 197 198 199 200
curl "http://localhost:9200/gitlab-development/_settings?pretty"
```

Add this to your `elasticsearch.yml` file:

Amy Qualls's avatar
Amy Qualls committed
201
```yaml
202
# turn off the disk allocator
Evan Read's avatar
Evan Read committed
203
cluster.routing.allocation.disk.threshold_enabled: false
204 205 206 207
```

_or_

Amy Qualls's avatar
Amy Qualls committed
208
```yaml
209
# set your own limits
Evan Read's avatar
Evan Read committed
210
cluster.routing.allocation.disk.threshold_enabled: true
211
cluster.routing.allocation.disk.watermark.flood_stage: 5gb   # ES 6.x only
Evan Read's avatar
Evan Read committed
212
cluster.routing.allocation.disk.watermark.low: 15gb
213 214 215
cluster.routing.allocation.disk.watermark.high: 10gb
```

216
Restart Elasticsearch, and the `read_only_allow_delete` will clear on it's own.
217

218
_from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/disk-allocator.html)_