Commit d35a199f authored by Dylan Griffith's avatar Dylan Griffith Committed by Dmitry Gruzd

Speed up Advanced global search regex for file path segments

From
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2318#note_367588644
we can see that this regex is capable of [catastrophic
backtracking](https://www.regular-expressions.info/catastrophic.html)
which we believe may be the cause of
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2318 .

This regex will replace the concept of trying to find sections of paths
with a simpler idea of just matching common characters used in file
paths and hopefully should be sufficient for most cases. It's not as
sophisticated as the approach which finds the sections between slashes
ending in a word boundary but it didn't seem easy to refactor that logic
without the backtracking.

We are also adding the
[`remove_duplicates`](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-remove-duplicates-tokenfilter.html)
filter since this new pattern will have lots of overlap with other
patterns and as such we need to remove the duplicates otherwise we'll
have many wasteful tokens in the index. There doesn't seem to be any
real cost to adding this filter since it only removes duplicate tokens
at the identical position so it seems to be sensible in all cases.
parent 6f52498e
---
title: Speed up Advanced global search regex for file path segments
merge_request: 35292
author:
type: performance
......@@ -111,7 +111,7 @@ Patterns:
- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
- `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`
- `'([\p{L}_.-]+)'` : some common chars in file names to keep the whole filename intact (eg. `my_file-ñame.txt`)
#### `edgeNGram_filter`
......
......@@ -38,7 +38,7 @@ module Elastic
code_analyzer: {
type: 'custom',
tokenizer: 'whitespace',
filter: %w(code lowercase asciifolding)
filter: %w(code lowercase asciifolding remove_duplicates)
},
code_search_analyzer: {
type: 'custom',
......@@ -61,7 +61,7 @@ module Elastic
'"((?:\\"|[^"]|\\")*)"', # capture terms inside quotes, removing the quotes
"'((?:\\'|[^']|\\')*)'", # same as above, for single quotes
'\.([^.]+)(?=\.|\s|\Z)', # separate terms on periods
'\/?([^\/]+)(?=\/|\b)' # separate path terms (like/this/one)
'([\p{L}_.-]+)' # some common chars in file names to keep the whole filename intact (eg. my_file-name.txt)
]
}
},
......
......@@ -622,6 +622,10 @@ RSpec.describe Gitlab::Elastic::SearchResults, :elastic, :sidekiq_might_not_need
Foo.bar(x)
include "bikes-3.4"
/a/longer/file-path/absolute_with_specials.txt
another/file-path/relative-with-specials.txt
/file-path/components-within-slashes/
another/file-path/differeñt-lønguage.txt
us-east-2
bye
......@@ -651,6 +655,22 @@ RSpec.describe Gitlab::Elastic::SearchResults, :elastic, :sidekiq_might_not_need
expect(search_for('"and;colons:too$"')).to include(file_name)
expect(search_for('bar\(x\)')).to include(file_name)
end
it 'finds absolute file paths with slashes and other special chars' do
expect(search_for('"absolute_with_specials.txt"')).to include(file_name)
end
it 'finds relative file paths with slashes and other special chars' do
expect(search_for('"relative-with-specials.txt"')).to include(file_name)
end
it 'finds file path components within slashes for directories' do
expect(search_for('"components-within-slashes"')).to include(file_name)
end
it 'finds file paths for various languages' do
expect(search_for('"differeñt-lønguage.txt"')).to include(file_name)
end
end
end
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment