WIP: Fix/rapid cdn promise relax (!1605) · Merge requests · nexedi / slapos

WIP: Fix/rapid cdn promise relax

Blocker: It's real problem that promises for slave instance preparation are failing, as they indicate that partitions needs to be reprocessed until everything is correctly setup. Tests on this branch are failing, simply exposing the real problem.

Attention: Do not simply silence the promises, as it will lead to problems. One have to rethink how to react on the promise state, and when they shall result with problems. Working on silencing tickets on master is NOGO.

Outcome: The promises promise-key-download-url-ready.py and publish-failsafe-error.py shall have some grace period, so that on real cluster they do not react too fast. Generally distributing the information about the slave requires a lot of processing on each partition, and with high amount of slaves this can take quite some time (up to 2 hours). The idea is, that such proimse shall be allowed to fail up to 5 times before anomaly would be detected, lowering the amount of tickets generated on live clusters after adding a slave.

Tasks:

slapos.toolbox!133 (merged)
slapos.toolbox!134 (closed)
configure check_file_state with proper TestLess, AnomalyResult and TestResult
configure proper grace period (failure_amount)
- assert that the grace period really works depending of promise configuration, if needed improve promise code

Edited Feb 13, 2025 by Łukasz Nowak