Draft: rapid-cdn: move to instance node managment with local instance database
Introduction
This Merge request is the follow up of !1947 (closed) that was abandoned as SlapOS Master doesn't support having Thousands of instances requested by the same instance inside the same instance tree. To solve the issue it was decided to integrate the work directly in the rapid-cdn SR
Why
This merge request introduces CDN Requester. The initial goal of this tool is to separate the technical CDN from the management of CDN requests that are related to sales. This comes from the need to validate domain name ownership and move towards allowing any customer to use the CDN without risks.
Once this SR is released, the goal is to progressively move all existing CDN instances to this new SR, with the Instance Node available in any project.
The first implementation tried to implement the management of instances in the Instance node (also called slave instance) in the traditional way. But the constraints of this approach are too well known when the number of instances rises, and it was decided to move to an implementation fully in Python to ensure scalability.
Requirements
- Be able to host a large amount of instances
- Instance list should not be stored in buildout (traces in log and large reprocessing)
- Only reprocess what needs to be reprocessed
- Garbage collect: If an instance is destroyed or stopped by the user then and only then the CDN entry should also be removed
- Persistence of DNS validation: Once a domain has been validated for an instance there is no need to revalidate it on each call
Consideration about instance node
Instances in the Instance Node are independent from the instance hosting them. To understand that, you need to see that two actors are involved when using an Instance Node:
- The Instance Operator: The one managing the host instance
- The users requesting instances on the instance node. Also called "Shared"/"Slave" instances.
In SlapOS we need to be able to alert each user only when necessary: which actor should be called to solve the issue. Just like we don't inform the compute node operator if one instance is failing.
As an example, if an instance hosted on the instance node is failing because the user inputs incorrect parameters, or because the DNS validation is not done yet, or because it has not finished deploying on the CDN, only the user of that instance should be informed and no alert should be done on the Instance Node as it has no impact on it from the instance operator's point of view.
Conclusion
In the long term, an Instance Node should be independent in its processing of the instances. Eventually there should be no need to reprocess the host instance if a new instance is allocated on it (not yet available with the current API).
It should reproduce the good practices of slapos node instance:
- Only process instances that need to be processed
- Have a promise validate the instance deployment
- Reprocess until the promise passes.
Implementation
LocalInstanceDB
Introduce a generic Recipe (slapos/recipe/localinstancedb.py) to:
- Offer a generic class to manage SQLite database
- Store the list of instances
- Compare list of instances to calculate what is:
- New
- Modified
- Removed
The InstanceListComparator class performs hash-based comparison: it computes SHA256 hashes of JSON-serialized parameters (sorted keys) to efficiently detect changes without full comparison. This prevents unnecessary reprocessing when only validation status changes.
slapconfiguration
Add a new entry point slapconfiguration.jsonschema.localdb to store the list of instances in a local database. Store the validation state of the parameters as this entry point inherits the JSONSchema one that validates the instance parameters against the instance schema.
The entry point writes validated instances to the database at instance-db-path, storing both valid and invalid instances with their validation results. This database is then read by the Instance Node to determine what needs processing.
Instance Node
First attempt at the implementation of an instance node (slapos/recipe/instancenode.py). The initial implementation has been done so that it can be used as a recipe, but it was later extended to be used as a script called every minute by cron.
Main loop
Here are the main steps:
- Get the list of instances to process from the master (stored in the database filled by slapconfiguration at
instance-db-path) - Compare it to the list of instances we processed (stored at
requestinstance-db-path) to see:- New instances
- Modified instances
- Removed instances
- Get the list of instances that need reprocessing (instances marked as invalid in the database that haven't been modified or removed)
- For each instance in: new, modified, need reprocess.
- Process the instance
- For removed instances
- Destroy instance
The comparison uses InstanceListComparator which compares hashes to detect parameter changes. Instances are only considered modified if their parameter hash changed.
Instance Processing
The processing of an instance is as follows:
- Check the result if parameters are validated against JSON Schema (done by slapconfiguration and stored in the DB)
- validateInstancePreDeployment: extra validation of parameters. For CDN Requester this is where we check DNS
- Deploy instance
- validate Instance Post Deployment (Promise)
- Mark the instance as okay
If at any step the validation fails, the processing stops and the error is returned to the user by publishing it.
Any step can be overridden by inheriting the InstanceNode class. This is what the CDN Requester does to perform specialized validation for the CDN request.
Connection parameters are only published if they differ from what's already stored in the database, avoiding unnecessary updates.
Instance destruction
This method allows ensuring clean destruction of the instance before removing it from our list of instances. By default it calls request the CDN instance in destroyed state to properly clean up the instance on the master before removing it from the local database.
Usage as script
Some tools have been added here to parse a config file, run with a PID file to avoid concurrency issues and have proper log configuration.
CDN Request
This class (slapos/recipe/cdnrequest.py) inherits Instance node and specializes:
- Prevalidation:
- Domain validation: checks domain ownership via DNS TXT record containing validation token
- Domain uniqueness: ensures domain is not already validated for another instance
- DNS resolution: uses DNS resolver (dnspython) with fresh cache at initialization (to bypass server DNS cache)
- Host tracking: stores validated domains and used hosts in
DomainValidationDBwith tables:-
domain_validation: storesinstance_reference,domain,token,validated,timestamp -
used_hosts: storeshost,instance_referencepairs to track host assignments. This ensure an alias cannot be added for an already validated domain.
-
- Destruction: removes domain validation entries and frees hosts when instance is destroyed
The CDN Request recipe also validates parameters like in rapid-cdn.
A new constraint has been added on server-alias for them to be a subdomain of the custom_domain to avoid multiplying domain ownership verification.
Notes from initial discussion:
So domain name validation should be done in a separate SR that request through a remote node to the Node.
SR CDN is purely technical now called "technical CDN". New SR is business is doing a lot a validation.
Premium CDN existing Shared instance are all ported to new SR.
Remove everyone from premium CDN project aside from its operators.
New SR can use Jinja with Buildout.
At the moment the Business CDN SR will only do the Domain name validation.
XXX Maybe we want to sign the parameters provided by the Business CDN to make sure we trust the source. After cleaning up the current data.
For each shared instance on the Business CDN it request a Shared Instance on the technical CDN
TODO
TODO Before release
-
Garbage collect of stopped or destroyed slave??? -
Check domain name via CDN via software PY -
Do not request to CDN with default parameters -
Add clear parameters for computer uid / instance uid for sla -
Add Egg tests -
Use JSON parameters -
Add subroutine to check domain validation. / Non Slapos dependent checks. -
Request 1000 share instance to test scalability and garbage collection -
Add promise for failing instance to help operator debug -
Clean up connection parameters and parameters for slaves sent and received from master -
Add SR tests -
Add bang to instance node deploy else the master node is not reprocessed.
TODO after release (to be validated)
-
slapconfiguration jsonschema slaves: Master doesn't send the SR of the slave. How do you check schema. -
How to create ticket to inform user its instance is failing -
Resiliency of databases?? What happens if the list of validated domain is lost?