Products.CMFActivity.ActivityTool: Improve behaviour on single-node instances.
- Ignore node preference when spawning activities. Otherwise, activities which are not spawned with a preferred node will get an effective priority penalty compared to same-priority activities spawned *with* a node preference, despite both being to execute by the same processing node. - Break activity processing loop when the current processing node is also the activity validation node. This avoids pathological cases of activity accumulation, for example when reindexing an entire site: _recurseCallMethod is spawned in processing_node=0, but immediateReindexObject is spawned in processing_node=-1 because of serialization_tag dependency, so with such loop _recurseCallMethod will be executed over and over, piling indexation activities up until _recurseCallMethod does not self-respawn. In turn, such activity accumulation lead to an increased overhead, and decreased activity processing efficiency. This may also allow multi-node instances to more reliably use the validation node as a processing node. The cost for multi-node instances of these changes should be absolutely minimal (no extra IO necessary, minimal extra code). A possible drawback on single-node instances is that tic period may become more important because process_timer will return more often.
-
Developer
it seems this commit break our tests , there has a lot pending
immediateReindexObject
activities , @vpelletier can you pls check? -
Owner
Thanks for the notification, this should be fixed by 4eb26017 : I did not identify that tests monkey-patch the
tic
method I modified here. -
Owner
...and one more fix, this time for some UI tests which now require more tic calls now: 208b11ad .
Hopefully this should be the last fix needed to stabilise the tests, as only
testFunctionalCore
anderp5_accounting_renderjs_ui_test:testFunctionalRJSAccountingReport
were failing... -
Owner
@vpelletier it seems the accounting report tests are still failing. I didn't investigate yet why this script starts failing after your change.
Maybe @jerome has an idea?
-
Owner
There are at least 3 issues behind these failures:
- since 49b9612b, the Preference owner is changed, but not which user has the
Owner
local role, which is inconsistent in itself. I am fixing this. -
Preference_disableOtherPreferences
searched for preferences byowner
but then checks the current user hard theOwner
local role. It should rather search byviewable_owner
, as this is based on the role and not document ownership, and if for whatever reason the owner does not have view permission for theirOwner
role, it would already not be listed in theowner
-based catalog result. I am fixing this. - ...that the above points have exactly nothing to do with my CMFActivity change. So there is more here than these issues. So I should keep looking before applying the above fixes to see what else I am missing.
[EDIT] I pushed the early fixes (untested for now) in vpelletier/erp5/fix_preference_ownership_checks.
- since 49b9612b, the Preference owner is changed, but not which user has the
-
Owner
And another bug in the bug in the bug: the ownership change does not reindex the modified document, so the tests behave differently before and after the first time that preference gets reindexed.
-
Owner
Ahah, this may be the explanation of all current errors in erp5 tests
-
Owner
I pushed the first 2 fixes (the BT dependency fix is unnecessary and I am getting some failures with it). I'm still trying to understand how my CMFActivity change caused all this to surface.
-
Owner
...yet another fix for code which relied on a limited number of tic calls to be enough: 1c3ff213 .
I am looking at
testERP5Security
failures, which seem to happen becausestop_condition
is being invoked many more times for a given activity queue (so the test discovers that there appears to be a hole when no user can log in - which is what the test was trying to verify), which is also causing a feed-back loop (try creating Person duringstop_condition
, get exception, no rollback: activities get queued, so one more tic loop, so one morestop_condition
call, etc). I do not have a fix for this ready yet.I am feeling that these failures are going on for too long, and while the issues discovered are generally real issues, I feel that it could maybe be better to revert the CMFActivity change (and corresponding
tic
API change fixup in one test), keep the other fixes in, and debug in a separate test runner. If this evening I cannot get decent test scores, I intend to do so, so next week should start on a clean test result (even though we now know there are bugs hidden by the old single-node tic scheduling). -
Owner
I am more and more tempted with reverting this (and the related fixups, but not the test fixes nor the tic-count-to-deadline changes).
Here is the breakdown of remaining failing tests:
-
erp5_workflow_test:testWorkflowAndDCWorkflow
: UID change (document changed between the moment the activity was spawned and the moment it was executed) -
erp5_core_test:testRestrictedPythonSecurity
: this one is surprising, the CRYPT password hashing scheme is causing the test to fail. I suspect this is unrelated to my change as this is not an ERP5 test, but a zope test. -
erp5_archive:testArchive
: takes longer than before (over 3 times longer than before) -
testUpgradeInstanceWithOldDataFs
: (likewise) -
erp5_configurator_standard:testStandardConfigurationWorkflow
: (likewise) -
erp5_officejs_support_request_ui_test:testFunctionalSupportRequest
: random UI test failure ? (timeout on non-activity-related selenium step) -
erp5_officejs_ui_test:testFunctionalOfficeJSoOoSpreadsheet
: random UI test failure ? (ERP5JS traceback on non-activity-related selenium step)
So overall:
- 4 maybe-issues, 3 likely unrelated to this commit
- 3 failures just because the tests take a lot longer than before to execute activities (because instead of looking deep inside CMFActivity, it is the test tic loop which iterates more than before - so more code needs to run for each activity execution)
@jerome reported a test installing an instance taking 5x longer than before to complete. This could be improved by modifying process_timer to loop instead of returning control to timerserver, but given the much-stronger-than-expected performance hit that even test tic is getting, I am not very optimistic about the final performances.
So while I like that tests could detect a lot of actual bugs (missing activity dependencies, or accidentally-strong expectations of some tests on how the activities are processed), the overall performance impact on high-saturation uses (like creating a new ERP5 instance and running tests) and ease of deploying a multi-zopes development instance, I now feel this change is not the best solution. Maybe we can decide that single-zope setups are not a supported setup for anything beyond environments with only a handful of documents ?
-
-
mentioned in commit 340acac6
-
Owner
Revert pushed: 340acac6 .
-
Owner
-
erp5_core_test:testRestrictedPythonSecurity
: this one is surprising, the CRYPT password hashing scheme is causing the test to fail. I suspect this is unrelated to my change as this is not an ERP5 test, but a zope test.
Yes, this one is surprising. It might not be related. I also saw it starting to fail recently, it's might be because of something else, maybe when we updated test machines to debian 11. I don't see what recent changes in erp5 or slapos might cause this.
-
-
Owner
maybe when we updated test machines to debian 11
This would be my guess, indeed. This, or maybe an updated egg responsible for registering this password hashing scheme to the registry (or otherwise messing with the registry).
crypt is 3des-based, not salted, ... Deprecating it would be a good thing. Having it as a fallback for any existing crypt-encoded password is nice, but AFAIK ERP5 has always used SSHA so far so it should not be necessary in any ERP5 instance.
EDIT: Just a random duckduckgo find: RIPE deprecated crypt-pw in its own password database in 2006~2007
-
Owner
it seems upstream also disabled the test https://github.com/zopefoundation/AuthEncoding/pull/11#issuecomment-815020045
-
Owner
Let's remove crypt test !1530 (merged)