CMFActivity.Activity.SQLBase: Reduce the number of deadlocks (!1491) · Merge requests · nexedi / erp5

CMFActivity.Activity.SQLBase: Reduce the number of deadlocks

MariaDB seems to be using inconsistent lock acquisition order when executing the activity reservation queries. As a consequence, it produces internal deadlocks, which it detects. Upon detection, it kills one of the involved query, which causes message reservation to fail, despite the presence of executable activities.

To avoid depending on MariaDB internal lock acquisition order, acquire an explicit table-scoped lock before running the activity reservation queries.

On an otherwise-idle 31 processing node cluster with the following activities spawned, designed to stress activity reservation queries (many ultra-short activities being executed one at a time):

active_getTitle = context.getPortalObject().portal_catalog.activate(
  activity='SQLQueue',
  priority=5,
  tag='foo',
).getTitle
for _ in xrange(40000):
  active_getTitle()

the results are:

a 26% shorter activity execution time: from 206s with the original code to 152s
a 100% reduction in reported deadlocks from 300 with the original code to 0

There is room for further improvements at a later time:

tweaking the amount of time spent waiting for this new lock to be available, set for now at 1s.
possibly bypassing this lock altogether when there are too few processing nodes simultaneously enabled, or even in an adaptive reaction to deadlock errors actually happening.
cover more write accesses to these tables with the same lock

From a production environment, it appears that the getReservedMessageList method alone is involved in 95% of these deadlocks, so for now this change only targets this method.

/cc @jm @georgios.dagkakis