1. 02 Nov, 2020 1 commit
    • Dylan Griffith's avatar
      Change more Elasticsearch indexes to keyword type · 5f33219d
      Dylan Griffith authored
      Related to https://gitlab.com/gitlab-org/gitlab/-/issues/213035 .
      
      The [Elasticsearch keyword type](
      https://www.elastic.co/guide/en/elasticsearch/reference/7.10/keyword.html)
      "is used for structured content such as IDs, email addresses, hostnames,
      status codes, zip codes, or tags". This index is preferred over the
      current [text type](
      https://www.elastic.co/guide/en/elasticsearch/reference/7.10/text.html)
      as the text type takes up more storage.
      
      The `text` type splits up the text as though it was human readable text
      (ie. splitting words apart) and indexes each word separately in the
      inverted index. As such the `text` type will usually take up more space
      in the inverted index and should only be used when you need to search
      for individual words in the text.
      
      For each of these cases this is not adding any value and possibly making
      certain searches incorrect. After testing locally this change appears to
      save `4%` disk storage.
      
      As per
      https://gitlab.com/gitlab-org/gitlab/-/issues/213035#note_439629162 here
      is the reasoning on a per field basis:
      
      1. `state/merge_status` => We only do exact matches against this for
      filtering. It's only 1 word so changing to keyword won't make any
      difference
      2. `target_branch/source_branch` => these are not used in any searches
      today so there is no risk to changing the index options. Changing this
      to keyword should have a decent storage improvement as these can be
      quite long and composed of many words
      3. `merge_status` => this is not used in any searches today so there is
      no risk to changing the index options. This appears to be things like
      `can_be_merged/cannot_be_merged/unchecked` which implies to me that it
      should be a keyword anyway as splitting this by word will be producing
      wrong results if we ever did filter on it and it will save some storage.
      4. `commit.(commiter/author).email` => this is used in commit searches
      today and it's hard to know exactly how this might be used by our
      current users.Users will lose some behaviour though if they were
      searching for partial email addresses before. For example you can
      [search for `dyl.griffith`](
      https://gitlab.com/search?scope=commits&repository_ref=&search=dyl.griffith&group_id=9970&project_id=278964)
      and you will find commits authored by my email address which starts with
      `dyl.griffith`. After this change to use keyword you'd need to search
      for the entire exact email address or you could use the prefix search
      `dyl.griffith*` as well. However, since prefix searches are (wildcards)
      can only be used at the end of the word you will not be able to search
      for `griffith` only after this change.
      5f33219d
  2. 01 Nov, 2020 3 commits
  3. 31 Oct, 2020 10 commits
  4. 30 Oct, 2020 26 commits