Fix assertion failure when retrying to migrate a workflow history (e57dd4fb) · Commits · nexedi / erp5

mentioned in commit e0805f9a

@all Due to a bug in ZSQLCatalog, the assertion failure can turn into data corruption, so this commit should be backported to all instances using ERP5 >= 2019-10-04 (the regression was in 1ca655a3).

An attempt to fix ZSQLCatalog (partially) was tried with:

--- a/product/ZSQLCatalog/SQLCatalog.py
+++ b/product/ZSQLCatalog/SQLCatalog.py
@@ -265,6 +265,12 @@ def __getitem__(self, index):
         except ConflictError:
           raise
         except Exception:
+          # XXX: Better than nothing. It is too dangerous to ignore errors
+          #      if there are changes to the ZODB. We already had cases of
+          #      data corruption.
+          conn = getattr(document, '_p_jar', None)
+          if conn is not None and not conn._needs_to_join:
+            raise
           LOG('SQLCatalog', WARNING,
             'Failed to call method %s on %r' % (attribute, document),
             error=True,

Unfortunately, as reported by https://nexedi.erp5.net/test_result_module/20200729-24F92F7A/ ("TypeError: getAcquiredCategoryList() takes exactly 2 arguments (1 given)" everywhere), this would need more work.

how does the data corruption look like ? workflow state getters return wrong result (another state) ? do we have some kind of error raised ?

TypeError: getAcquiredCategoryList() takes exactly 2 arguments (1 given)" everywhere

Yes, def getAcquiredCategoryList(self) in Base and def getAcquiredCategoryList(self, context) in CategoryTool. We already fixed several cases like this of two methods with the same name, but with different signatures and different semantics. pylint detects these problems, so hopefully we are no longer introducing new.

how does the data corruption look like ?

And is there a known way to fix documents where it already happened ?

BTW, this fix (beyond the issue found in tests) will only work for single-ZODB instances, as documents from multiple connectors may be indexed. If this fix direction is kept, this may need to rather introspect the current transation, find registered ZODB connections and check them all. I know this is not the way we want new instances to be, but I believe we have still-maintained instances which do.

And more broadly, if we have the opportunity I would like the effort to be in the direction of getting rid of this broad try..except altogether: "silently" (because I do not believe anyone notices these logs) skipping part of the indexation of a document can be worse than having a visible failure: then catalog query results are inconsistent with ZODB content, and in turn code running these queries will take inconsistent decisions. Which means we need to locate and fix these API inconsistencies, and while they can be hard to fix at least should be very easy to detect.

For information, here is the summary Rafael gave me after the bug discovered in SlapOS master:

Rafael
basically workflow history got corrupted by a bug

Tomo
in zodb ?

Rafael
yes 
in short, there is a conflict whiile migrate the workflow history to new format:

2020-07-06 22:33:33,665 WARNING SQLDict Exception raised when invoking messages (uid, path, method_id) [(4055295444616983670L, ('', 'erp5', 'software_instance_module', '20180214-2F346902'), 'immediateReindexObject'), (4055295444616983671L, ('', 'erp5', 'software_instance_module', '20180213-24B5A97F'), 'immediateReindexObject')]
Traceback (most recent call last):
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/CMFActivity/Activity/SQLBase.py", line 666, in dequeueMessage
    transaction.commit()
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_manager.py", line 123, in commit
    return self.get().commit()
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 280, in commit
    reraise(t, v, tb)
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 271, in commit
    self._commitResources()
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/Products.TIDStorage-5.5.0-py2.7.egg/Products/TIDStorage/transaction_transaction.py", line 251, in _commitResources
    result = original__commitResources(self, *args, **kw)
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 416, in _commitResources
    reraise(t, v, tb)
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 393, in _commitResources
    rm.tpc_vote(self)
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/ZODB-4.4.5-py2.7.egg/ZODB/Connection.py", line 797, in tpc_vote
    s = vote(transaction)
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/ZEO-4.3.1+slapospatched001-py2.7.egg/ZEO/ClientStorage.py", line 1072, in tpc_vote
    return self._check_serials()
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/ZEO-4.3.1+slapospatched001-py2.7.egg/ZEO/ClientStorage.py", line 902, in _check_serials
    raise s
ConflictError: database conflict error (oid 0x09cb6ceb, class Products.ERP5Type.Workflow.WorkflowHistoryList, serial this txn started with 0x03d913a98404f0bb 2020-07-06 22:33:30.942023, serial currently committed 0x03d913a98e78f833 2020-07-06 22:33:33.392001)
while cataloging the document
and it was caused because this warnnig:

2020-07-06 22:33:33,944 WARNING SQLCatalog Failed to call method getCreationDate on <Products.ERP5Catalog.CatalogTool.IndexableObjectWrapper for /erp5/software_instance_module/20180214-2F346902>
Traceback (most recent call last):
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ZSQLCatalog/SQLCatalog.py", line 264, in __getitem__
    value = value()
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/Base.py", line 3236, in getCreationDate
    return history_list['edit_workflow'][0]['time']
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/ConflictFree.py", line 104, in __getitem__
    return next(islice(self, start, None))
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/ConflictFree.py", line 57, in __iter__
    bucket = self._next
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/patches/WorkflowTool.py", line 745, in _next
    self._migrate()
  File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/patches/WorkflowTool.py", line 762, in _migrate
    assert bucket.__class__ is WorkflowHistoryList, bucket.__class__
AssertionError: <class 'Products.ERP5Type.Workflow.WorkflowHistoryList'>
in general this a warnning and not an error
so zodb isn't roll back on this thread, causing migration fail, corrupting the history
so, only happens when conflicts in this way
the fix is:
https://lab.nexedi.com/nexedi/erp5/commit/e57dd4fb090387ad546ee9f325cb9c9c0b841331
to the next documents
I follow up the suggestion of julien and recreated the corrupted workflow history 
first using:
"Base_detectAndFixWorkflowHistory"
(I added on production)
and considering I wouldn't know who was corrupted or not (consider that the bug only shows up when a buffer is filled)
so it is a lazy problem, some problems are from 7/07 and only raised now
so I wrote this:

root@sd-148139:/srv/slapgrid/slappart0/srv/runner/instance/slappart4/software_release/parts/erp5/product# git diff
diff --git a/product/ERP5Type/patches/WorkflowTool.py b/product/ERP5Type/patches/WorkflowTool.py
index 366c57107a..b91847b125 100644
--- a/product/ERP5Type/patches/WorkflowTool.py
+++ b/product/ERP5Type/patches/WorkflowTool.py
@@ -759,6 +759,7 @@ class WorkflowHistoryList(NewWorkflowHistoryList):
     stack = [self]
     while True:
       stack.append(bucket)
+      bucket._p_activate()
       assert bucket.__class__ is WorkflowHistoryList, bucket.__class__
       bucket.__class__ = NewWorkflowHistoryList
       bucket = bucket._prev
@@ -792,6 +793,14 @@ def WorkflowTool_setStatusOf(self, wf_id, ob, status):
             if wfh is not None and not isinstance(wfh, NewWorkflowHistoryList):
                 wfh = NewWorkflowHistoryList(wfh)
                 ob.workflow_history[wf_id] = wfh
+            # Hackish fixup
+            elif wfh is not None and isinstance(wfh, NewWorkflowHistoryList):
+              if wfh._prev is not None and wfh._next is None:
+                wfh = list(reversed(wfh))
+                wfh.reverse()
+                wfh = NewWorkflowHistoryList(wfh)
+                ob.workflow_history[wf_id] = wfh
+            #    LOG('WorkflowTool', WARNING, 'Bad workflow migrated %s %s' % (wf_id, ob.getUid())
     if wfh is None:
         wfh = NewWorkflowHistoryList()
         if not has_history:
in general, I fix when it before it gets appended
then the problem is gone ```

ConflictError: ... while cataloging the document and it was caused because this warnnig

No. The explanation is wrong.

First, there was a normal migration and the transaction got aborted for a reason that does not matter (the fact that at this moment, it was a ConflictError during reindexing is irrelevant).

Then, while reindexing, there was a second attempt to migrate the workflow history and that's where things started to go wrong, because on transaction abort, ZODB.Connection did not restore the class as it is actually stored (more generally, as written in WorkflowHistoryList.setstate, ZODB.Connection never fixes the class when loading an object). This led to an assertion failure (unexpected new class), whereas the workflow history was partially migrated. By ignoring the exception, ZSQLCatalog forced the transaction to commit with data that are neither in old format nor in new format (somewhere between and not usable by the code).

(consider that the bug only shows up when a buffer is filled)

( s/buffer/bucket/ )

Not only. This corruption also causes bugs immediately without raising exceptions (e.g. wrong __len__).

Thanks Julien to fix up the explanation.

Base_detectAndFixWorkflowHistory

This script is not available in this repository.

With the new posts, I believe the answers to the questions I and Jérôme asked are:

how does the data corruption look like ? workflow state getters return wrong result (another state) ? do we have some kind of error raised ?

AsserttionError gets raised from a type-checking statement, at least when migration is retried. But is it also the case on non-migration future calls, if there are such things ? Or will it always trigger a new migration attempt, which would always detect the issue and react with such AssertionError ?

Also, if this migration is triggered when just reading documents (ex: calling getCreationDate on a document), then it means that not all corrupted documents are visible in the catalog. For those which are visible in the catalog, it would mean they have NULL in creation_date, modification_date, and *_state columns.

It would be nice to have a clear confirmation of this, rather than being left to guess what we should be looking for.

And is there a known way to fix documents where it already happened ?

Yes, but not published yet. The closest we have so far is the "Hackish fixup" from Thomas' quote of Rafael message.

It would be nice to share the tools in case others uncover other occurrences of the same issue, rather than leaving them to duplicate debugging work.

Is this an accurate summary of the current situation ?

There's little to see in the WorkflowHistoryList class and its __setstate__ method:

if the workflow history is migrated, there isn't anything overloading DoublyLinkList to trigger automatic migration;
else a legacy class is used, with a few overloading attributes migrating automatically if there's any benefit doing so. The type-checking assertion is in the migration code.

The involved code is small. The important lines highlighted. Immediate consequences described. Further consequences can be varied and you have enough information to guess them as well as me.

I also didn't think about ways to find all corrupted workflow histories.

The issue only affects workflow histories with more than 16 entries.

Also, if this migration is triggered when just reading documents (ex: calling getCreationDate on a document), then it means that not all corrupted documents are visible in the catalog. For those which are visible in the catalog, it would mean they have NULL in creation_date, modification_date, and *_state columns.

Not sure all columns would be NULL. Maybe only the first computed one. For other columns, the workflow history is only using the new code and because the data aren't completely migrated, it can either return correct values (e.g. ...[-1]) or wrong values (e.g. ...[0]). I don't see further exceptions on read operations (only when adding items).

mentioned in commit c7fa2ba9

mentioned in commit ca50a17d

mentioned in commit 40bca2d2

@@ -173,6 +173,8 @@ class TestWorkflowHistoryList(TestCase):
     self.checkClass(whl, WorkflowHistoryList)
     self.abort()
     self.assertEqual(len(whl), COUNT)
     self.abort()
     self.checkClass(whl, LegacyWorkflowHistoryList)
     self.assertEqual(len(whl), COUNT)
     self.checkClass(whl, WorkflowHistoryList)
-...