• Dylan Griffith's avatar
    Change more Elasticsearch indexes to keyword type · 5f33219d
    Dylan Griffith authored
    Related to https://gitlab.com/gitlab-org/gitlab/-/issues/213035 .
    
    The [Elasticsearch keyword type](
    https://www.elastic.co/guide/en/elasticsearch/reference/7.10/keyword.html)
    "is used for structured content such as IDs, email addresses, hostnames,
    status codes, zip codes, or tags". This index is preferred over the
    current [text type](
    https://www.elastic.co/guide/en/elasticsearch/reference/7.10/text.html)
    as the text type takes up more storage.
    
    The `text` type splits up the text as though it was human readable text
    (ie. splitting words apart) and indexes each word separately in the
    inverted index. As such the `text` type will usually take up more space
    in the inverted index and should only be used when you need to search
    for individual words in the text.
    
    For each of these cases this is not adding any value and possibly making
    certain searches incorrect. After testing locally this change appears to
    save `4%` disk storage.
    
    As per
    https://gitlab.com/gitlab-org/gitlab/-/issues/213035#note_439629162 here
    is the reasoning on a per field basis:
    
    1. `state/merge_status` => We only do exact matches against this for
    filtering. It's only 1 word so changing to keyword won't make any
    difference
    2. `target_branch/source_branch` => these are not used in any searches
    today so there is no risk to changing the index options. Changing this
    to keyword should have a decent storage improvement as these can be
    quite long and composed of many words
    3. `merge_status` => this is not used in any searches today so there is
    no risk to changing the index options. This appears to be things like
    `can_be_merged/cannot_be_merged/unchecked` which implies to me that it
    should be a keyword anyway as splitting this by word will be producing
    wrong results if we ever did filter on it and it will save some storage.
    4. `commit.(commiter/author).email` => this is used in commit searches
    today and it's hard to know exactly how this might be used by our
    current users.Users will lose some behaviour though if they were
    searching for partial email addresses before. For example you can
    [search for `dyl.griffith`](
    https://gitlab.com/search?scope=commits&repository_ref=&search=dyl.griffith&group_id=9970&project_id=278964)
    and you will find commits authored by my email address which starts with
    `dyl.griffith`. After this change to use keyword you'd need to search
    for the entire exact email address or you could use the prefix search
    `dyl.griffith*` as well. However, since prefix searches are (wildcards)
    can only be used at the end of the word you will not be able to search
    for `griffith` only after this change.
    5f33219d
config.rb 9.23 KB