info:To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Efficient `IN` operator queries
This document describes a technique for building efficient ordered database queries with the `IN`
SQL operator and the usage of a GitLab utility module to help apply the technique.
-> Bitmap Index Scan on index_namespaces_on_traversal_ids (cost=0.00..141.55 rows=274 width=0) (actual time=1.897..1.898 rows=265 loops=1)
Index Cond: (traversal_ids @> '{9970}'::integer[])
Buffers: shared hit=91
-> Index Only Scan using index_projects_on_namespace_id_and_id on projects (cost=0.44..2.40 rows=20 width=8) (actual time=0.004..0.006 rows=6 loops=265)
Index Cond: (namespace_id = (namespaces.traversal_ids)[array_length(namespaces.traversal_ids, 1)])
Heap Fetches: 51
Buffers: shared hit=2263
-> Index Scan using index_issues_on_project_id_and_iid on issues (cost=0.57..10.57 rows=544 width=1329) (actual time=0.114..0.484 rows=158 loops=1528)
Index Cond: (project_id = projects.id)
Buffers: shared hit=236524 read=3060
I/O Timings: read=336.879
Planning Time: 7.750 ms
Execution Time: 967.973 ms
(36 rows)
</code></pre>
</details>
The performance of the query depends on the number of rows in the database.
On average, we can say the following:
- Number of groups in the group-hierarchy: less than 1 000
- Number of projects: less than 5 000
- Number of issues: less than 100 000
From the list, it's apparent that the number of `issues` records has
the largest impact on the performance.
As per normal usage, we can say that the number of issue records grows
at a faster rate than the `namespaces` and the `projects` records.
This problem affects most of our group-level features where records are listed
in a specific order, such as group-level issues, merge requests pages, and APIs.
For very large groups the database queries can easily time out, causing HTTP 500 errors.
## Optimizing ordered `IN` queries
In the talk
["How to teach an elephant to dance rock'n'roll"](https://www.youtube.com/watch?v=Ha38lcjVyhQ),
Maxim Boguk demonstrated a technique to optimize a special class of ordered `IN` queries,
such as our ordered group-level queries.
A typical ordered `IN` query may look like this:
```sql
SELECTt.*FROMt
WHEREt.fkeyIN(value_set)
ORDERBYt.pkey
LIMITN;
```
Here's the key insight used in the technique: we need at most `|value_set| + N` record lookups,
rather than retrieving all records satisfying the condition `t.fkey IN value_set` (`|value_set|`
is the number of values in `value_set`).
We adopted and generalized the technique for use in GitLab by implementing utilities in the
`Gitlab::Pagination::Keyset::InOperatorOptimization` class to facilitate building efficient `IN`
queries.
### Requirements
The technique is not a drop-in replacement for the existing group-level queries using `IN` operator.
The technique can only optimize `IN` queries that satisfy the following requirements:
-`LIMIT` is present, which usually means that the query is paginated
(offset or keyset pagination).
- The column used with the `IN` query and the columns in the `ORDER BY`
clause are covered with a database index. The columns in the index must be
in the following order: `column_for_the_in_query`, `order by column 1`, and
`order by column 2`.
- The columns in the `ORDER BY` clause are distinct
(the combination of the columns uniquely identifies one particular column in the table).
WARNING:
This technique will not improve the performance of the `COUNT(*)` queries.
## The `InOperatorOptimization` module
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67352) in GitLab 14.3.
The `Gitlab::Pagination::Keyset::InOperatorOptimization` module implements utilities for applying a generalized version of
the efficient `IN` query technique described in the previous section.
To build optimized, ordered `IN` queries that meet [the requirements](#requirements),
use the utility class `QueryBuilder` from the module.
NOTE:
The generic keyset pagination module introduced in the merge request
.from("UNNEST(#{list(order_by_columns.array_aggregated_column_names)}) WITH ORDINALITY AS u(#{list(order_by_columns.original_column_names)}, position)")
order_by_columns.each{|column|q.where(Arel.sql(column.original_column_name).not_eq(nil))}# ignore rows where all columns are NULL