Speed up Advanced global search regex for file path segments
From https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2318#note_367588644 we can see that this regex is capable of [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html) which we believe may be the cause of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2318 . This regex will replace the concept of trying to find sections of paths with a simpler idea of just matching common characters used in file paths and hopefully should be sufficient for most cases. It's not as sophisticated as the approach which finds the sections between slashes ending in a word boundary but it didn't seem easy to refactor that logic without the backtracking. We are also adding the [`remove_duplicates`](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-remove-duplicates-tokenfilter.html) filter since this new pattern will have lots of overlap with other patterns and as such we need to remove the duplicates otherwise we'll have many wasteful tokens in the index. There doesn't seem to be any real cost to adding this filter since it only removes duplicate tokens at the identical position so it seems to be sensible in all cases.
Showing
Please register or sign in to comment