CMFActivity.Activity.SQLBase: Reduce the number of deadlocks (!1491) · Merge Requests · nexedi / erp5

CMFActivity.Activity.SQLBase: Reduce the number of deadlocks

MariaDB seems to be using inconsistent lock acquisition order when executing the activity reservation queries. As a consequence, it produces internal deadlocks, which it detects. Upon detection, it kills one of the involved query, which causes message reservation to fail, despite the presence of executable activities.

To avoid depending on MariaDB internal lock acquisition order, acquire an explicit table-scoped lock before running the activity reservation queries.

On an otherwise-idle 31 processing node cluster with the following activities spawned, designed to stress activity reservation queries (many ultra-short activities being executed one at a time):

active_getTitle = context.getPortalObject().portal_catalog.activate(
  activity='SQLQueue',
  priority=5,
  tag='foo',
).getTitle
for _ in xrange(40000):
  active_getTitle()

the results are:

a 26% shorter activity execution time: from 206s with the original code to 152s
a 100% reduction in reported deadlocks from 300 with the original code to 0

There is room for further improvements at a later time:

tweaking the amount of time spent waiting for this new lock to be available, set for now at 1s.
possibly bypassing this lock altogether when there are too few processing nodes simultaneously enabled, or even in an adaptive reaction to deadlock errors actually happening.
cover more write accesses to these tables with the same lock

From a production environment, it appears that the getReservedMessageList method alone is involved in 95% of these deadlocks, so for now this change only targets this method.

/cc @jm @georgios.dagkakis

MariaDB seems to be using inconsistent lock acquisition order when
executing the activity reservation queries. As a consequence, it produces
internal deadlocks, which it detects. Upon detection, it kills one of the
involved query, which causes message reservation to fail, despite the
presence of executable activities.

To avoid depending on MariaDB internal lock acquisition order, acquire an
explicit table-scoped lock before running the activity reservation queries.

On an otherwise-idle 31 processing node cluster with the following
activities spawned, designed to stress activity reservation queries
(many ultra-short activities being executed one at a time):

```python
active_getTitle = context.getPortalObject().portal_catalog.activate(
  activity='SQLQueue',
  priority=5,
  tag='foo',
).getTitle
for _ in xrange(40000):
  active_getTitle()
```

the results are:
- a 26% shorter activity execution time: from 206s with the original code
  to 152s
- a 100% reduction in reported deadlocks from 300 with the original code
  to 0

There is room for further improvements at a later time:
- tweaking the amount of time spent waiting for this new lock to be
  available, set for now at 1s.
- possibly bypassing this lock altogether when there are too few processing
  nodes simultaneously enabled, or even in an adaptive reaction to
  deadlock errors actually happening.
- cover more write accesses to these tables with the same lock

From a production environment, it appears that the `getReservedMessageList`
method alone is involved in 95% of these deadlocks, so for now this change
only targets this method.

/cc @jm @georgios.dagkakis

CMFActivity.Activity.SQLBase: Reduce the number of deadlocks

Revert this commit

Cherry-pick this commit