Commit e4273c58 authored by Vincent Pelletier's avatar Vincent Pelletier

CMFActivity: Simplify validation queries further.

The query planner does not seem to notice that we are trying to know if
any row exists matching a set of dependency values, and it keeps scanning
multiple row for each value - which is unproductive.
So split dependency queries from (pseudo-code)
  WHERE <column{,s}> IN <values{, pairs}>
to unions of
  WHERE <column{,s}> = <value{, pair}> LIMIT 1
which produces query plans which do stop immediately when finding a
candidate row.
On a serialization_tag query with 40 values and real-world indexations,
this reduces the number of rows scanned by mariadb from 500 (<10%
efficiency) to 40 (100% efficiency).
The produced SQL is significantly larger (~3x, around 500kB on
real-world sample data, but may vary a lot depending on value length),
but if this has any effect is is more than compensated by the improved
query plan efficiency.
parent 25085d7b
Pipeline #13576 failed with stage
in 0 seconds
......@@ -550,37 +550,26 @@ CREATE TABLE %s (
# No more non-blocked message for this dependency, skip it.
continue
column_list, to_sql = dependency_tester_dict[dependency_name]
if len(column_list) == 1:
row2key = _ITEMGETTER0
dependency_sql = to_sql(dependency_value_dict.keys(), quote)
else:
row2key = _IDENTITY
# XXX: generated SQL could be simpler: for example, a dependency input
# as
# ('foo', ('bar', 'baz'))
# will become
# (... = 'foo' AND ... = 'bar') OR (... = 'foo' AND ... = 'baz')
# This is the correct condition, but it could be expressed with shorter
# SQL. But I'm not sure this makes much of a difference for the query
# planner, it would likely increase the complexity here a lot, and
# anyway these multi-column dependencies should rather be replaced with
# tags (as it often possible and produces better overall activity
# behaviour).
dependency_sql = ' OR '.join(
'(' + to_sql(dependency_value, quote) + ')'
for dependency_value in dependency_value_dict
)
base_sql_prefix = '(SELECT DISTINCT %s FROM ' % (
','.join(column_list),
row2key = (
_ITEMGETTER0
if len(column_list) == 1 else
_IDENTITY
)
base_sql_suffix = ' WHERE processing_node > %i AND (%s))' % (
base_sql_suffix = ' WHERE processing_node > %i AND (%%s) LIMIT 1)' % (
DEPENDENCY_IGNORED_ERROR_STATE,
dependency_sql,
)
sql_suffix_list = [
base_sql_suffix % to_sql(dependency_value, quote)
for dependency_value in dependency_value_dict
]
base_sql_prefix = '(SELECT %s FROM ' % (
','.join(column_list),
)
for row in db.query(
' UNION '.join(
base_sql_prefix + table_name + base_sql_suffix
base_sql_prefix + table_name + sql_suffix
for table_name in table_name_list
for sql_suffix in sql_suffix_list
),
max_rows=0,
)[1]:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment