CMFActivity: Stop removing duplicates at validation time, and stop using portal_ids to allocate activity uids
The 2 commits in this merge request are only related in that they touch CMFActivity.
The goals of this merge request are, in decreasing order:
- To publicise these changes now. We intend to use them in production, so maybe other could be interested, or spot issues we would not discover until later.
- To keep track of what was already discussed on their topic, and to allow continuing discussion.
- And, as any merge request, to eventually get merged into master.
Not deleting duplicate activities during validation
This commit is mainly for consistency and somewhat for performance reasons.
The condition used in the code to determine activity unicity is insufficient. For example, it ignores serialization_tag value. Deleting activities which are not really identical is wrong (dependency contract violation). So there are 2 solutions: either fix duplication detection, or disable deletion on validation.
Fixing duplication detection would mean loading and introspecting more of the activity than is already done. As this is happening during activity validation, this is one of the rare pieces of code which cannot be paralellised, by definition. So increasing the amount of work always done (for each validated activity) does not seem wise.
So what about disabling duplicate deletion ?
Duplicate activities are currently deleted in 3 places:
- by the transaction creating the activities
- by the validation node when validating pending activities
- by the processing node when picking which activities it should reserve (and then execute a subset of these, based on unicity)
So removing deduplication at validation step does not mean we will necessarily execute more duplicates: they can still be caught later. Moreover, SQLDict never guarantees duplicate work avoidance, it is trying to make it as low as reasonably possible. It would not be reasonable to spend more work to remove duplicate activities than to execute the duplicates to begin with. And such duplicates should normally be rather rare anyway: it is better (leads coser to optimum) to work to avoid causing them to begin with than only relying on CMFActivity to avoid these later. So removing deduplication at validation seems reasonable.
Is it enough ?
No.
Similar code detecting duplicate activities is used in all 3 places listed above. In the 3rd one it actually does not matter, because no dependency can be violated anymore: all activities were validated, so there is by definition all dependencies are already satisfied.
But the code removing duplicates at insertion time still has such bug, so not-quite-duplicate activities created in a single transaction would still be subject to undue deduplication. I believe this belongs to a separate patch, as in this case we will likely want to actually fix the deduplication, and not to get rid of it.
Not using portal_ids
portal_ids, because it uses an SQL table for id generation state persistence, poses significant challenge in terms of restoring backups: it will be desynchronised from ZODB when restored. So any id generation which can avoid relying on it, should.
In the case of CMFActivity, all that is fundamentally required is that no new activity uses the id of an activity already present in the table. It is not a problem to recycle the uid of a previous activity. So the id generation does not fundamentally need to have a persistent state depending on how many activities were generated, it only needs to care about the current state of the activity tables.
Here, proposed solution is to produce uids using a cheap random number generator (as there is no obvious reason to require cryptographic-grade randomness in such internal identifiers - we would already be in trouble if an unexpected user could act on pending activities so uid knowledge should not change the threat much), and to increase the uid space to 64 bits, up from 32, to accomodate millions of pending activities while keeping collision probability low. As in a healthy site the activity tables always tend towards 0 activities, random uid generation is essentially bounded in its collision probability.
An alternative solution could be to, in this case, rely on SQL AUTO_INCREMENT
columns, which apparently have been improved (performance-wise) since the last time ERP5 relied on these. This would require benchmarking to be sure. One thing which does not seduce me about this solution is that we then rely on SQL server implementation and worry about its internal aspects (ex: does it play nice with replication ? backups ? what if locking improvements break later ? or in another implementation we could want to switch to ?...). And advantage is that it produces "nice" uids as we are used to, although I am personally not attached at all to this, and prefer being in control of our own generation.
/cc @nexedi