slapgrid: Start services even without connection to master (!515) · Merge requests · nexedi / slapos.core

slapgrid: Start services even without connection to master

When the master is unreachable, fall back on starting the services of all partitions started state (as of last known information).

To this end, remove the supervisor configuration file when stopping a partition, and (re)create it when starting it, so that the only the processes that should be started exist. Then the fallback is as easy as asking supervisor to start all existing processes.

It turns out there is a complicating factor: managers. They also add processes under custom process groups in the partition's supervisor configuration. So simply removing all of it on stop had unintended side effects, e.g. manager processes that should be run just after node instance were not run if the instance was stopped during node instance (thankfully we had tests for this, because I wouldn't have caught it otherwise).

So to accommodate managers, I changed the organization of the partition's supervisor files so that each process group now has it's own file. To ensure collisions between partitions can never happen, I force each process group to begin with the id of the partition - this was already the case in practice. So the configuration file for the partition's normal processes (~/etc/service, ~/etc/run) is called <partition-id>.conf, and the file for some manager's processes is called <partition-id>-<custom-suffix>.conf. I think this implementation is cleaner anyway.

This way, on stopping a partition only the <partition-id>.conf file is removed.

The previously merged feature of starting services on boot - !431 (merged) - has been reverted, since there is no need for a special case on boot now. We still need boot to be able to format even if the master is unreachable, but instead of the --local option that !431 (merged) added to format, I let boot distinguish between format merely failing to report to master or truly failing to format, and retry accordingly. This way when master is unreachable local formatting still takes place, but as soon as it is reachable up-to-date information is reported, which as it turns out is important because instance get their information (IP addresses...) from master instead of locally (that should also change one day but it's another topic).

Edited Apr 11, 2023 by Xavier Thompson