Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
slapos slapos
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Labels
    • Labels
  • Merge requests 122
    • Merge requests 122
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Environments
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Jobs
  • Commits
Collapse sidebar
  • nexedi
  • slaposslapos
  • Merge requests
  • !1679

You need to sign in or sign up before continuing.
Open
Created Nov 05, 2024 by Xavier Thompson@xavier_thompsonOwner31 of 40 tasks completed31/40 tasks
  • Report abuse
Report abuse

Draft: erp5: Introduce mariadb replication at SlapOS level

  • Overview 121
  • Commits 38
  • Changes 28

EDIT: I rewrote the description to focus on the key points because the previous description had gotten way too long and technical. Everything is described in the commit messages. I invite you to read them in order for a detailed understanding.

Motivation

Despite its name and the high focus on mariadb replication, the overall concern of this MR is the wider question of ERP5 resiliency. Not resiliency with ERP5 inside Theia, but "native" resiliency of ERP5. The involves replicating ERP5's object (ZODB) and SQL database (index catalog, activities, ...). ZODB replication is already well implemented using Neo, thus this MR focuses mostly on mariadb replication; but it does bring some improvements to Neo.

Before this MR, some ERP5 projects on the cutting edge already use Neo + mariadb replication to ensure ERP5 resiliency. But this is mostly done and maintained manually outside of SlapOS. The goal is to mainstream this technique by automating it by integrating it inside SlapOS. Ultimately, EPR5 replication inside Theia should be replaced by "native" ERP5 replication everywhere.

This MR does not aim to complete this transition all in a single step. Instead it make a significant step in this direction.


Overview of tasks and future todos

Some of these will not be implemented in this current MR

  • neo
    • Request a neo replica without needing to manually access the partition to set its state to BACKINGUP — Powered by neoppod!25 (merged)
    • Make check_neo_health promise assert that state is BACKINGUP when it should be — Powered by slapos.toolbox!139 (merged) (backported in slapos.toolbox==0.128.2)
    • Make check_neo_health promise not bang when it fails — Powered by slapos.core!786 (merged)
  • zope
    • Deactivate zope promises when the neo is expected to be BACKINGUP (temporary solution)
    • Adapt the zope service so that it detects when neo is in BACKINGUP state and goes on standby until neo is RUNNING (instead of crashing)
    • Support starting only select zope processes, e.g. to disable external interfaces when creating a dev or test clone of an ERP5
  • mariadb
    • Remove mariadb_update service that could break replication — instead users are only created on database creation, and updater is run on every mariadb restart
    • Create a replication_user with REPLICATION SLAVE grant and a randomly generated password (or the same password as the primary)
    • Support mariabackups in addition to sqldumps — introduce new backup parameter dict to control backups (£ — !1792)
    • Optimize mariabackups size and speed by mixing full and incremental mariabackups as introduced in !1792 (£ — !1792)
    • Serve mariadb backups statically with simplehttpserver so that another mariadb can fetch them to bootstrap replication (££ — binlogs retention)
    • Introduce replication parameters to make a mariadb replicate & bootstrap from another mariadb instance (£££ — usage)
    • Make mariadb_replication promise not bang when it fails — Powered by slapos.core!786 (merged)
    • Support disabling TCP access on mariadb replica
    • Enable TLS IPv6 https:// access to bootstrap and TLS IPv6 mysql:// access to replication_user of mariadb (££££ — example)
      • caucased
        • Introduce an embedded caucased server and autoapprove a caucase user (=admin) certificate; publish the embedded caucased url.
        • Request and renew locally the autoapproved caucase user certificate
        • Request a local caucase service certificate for mariadb & bootstrap TLS access
        • Automatically sign the local service certificate using the user certificate
        • Allow external certificate requests to this embedded caucased to be signed by passing the CSR via new csr-to-sign parameter
        • Support passing an external-caucased-url instead of launching an embedded caucased; in that case there is no user certificate and nothing is automatically approved
      • reverse-proxy: haproxy & proxysql
        • Use haproxy to serve backups for bootstrap over IPv6 https://
        • Use proxysl to give access to replication_user over IPv6 mysql://
        • Decide whether ProxySQL's lack of CRL support is an issue, and find a workaround or another solution if it is
      • replica mTLS
        • Pass the primary's caucased-url to mariadb replicas so that it can request and renew a replica caucased service certificate
        • Make the replica publish the corresponding CSR
        • Make the replica connect with mTLS to the primary's bootstrap https server (behind haproxy) and mysql:// mariadb (behind proxysql)
  • takoever
    • Provide a takoever script (mariadb-replica-become-primary) in the mariadb partition that can be called manually by logging as compute node administrator into the partition
    • Allow a replica mariadb to stop replicating and become a primary without requiring manual login to the instance and manual operations on the DB (e.g. by providing a url where the user can click to perform this action) — this will be a necessary step of an eventual automated takeover procedure
    • Streamline the takoever steps on neo in a single script: change state to RUNNING, truncate — this will be a necessary step of an eventual automated takeover procedure
    • Provide a non-manual way for a replica neo to become a primary (e.g. by providing a url where the user can click to perform this action) — this will be a necessary step of an eventual automated takeover procedure
    • Integrate the procedure to make mariadb coherent with a truncated neo using ERP5Site_resynchroniseCatalogSince (£££££)
    • Provide a comprehensive "one-click" takoever method for a whole replica ERP5: mariadb takoever + neo takeover + neo truncation + ERP5Site_resynchroniseCatalogSince + zope management

Footnotes

£: !1792 proposes a much more advanced way to generate and store mariabackups, using frequent incremental mariabackups combined with infrequent full mariabackups, and storing them with restic. This makes for faster and smaller backups. Restic stores the backups as content defined chunks, so the backups are not available as a single file without asking restic to reconstitute it. Thus using restic will imply serving the bootstrap backups withs something like rest server that will reconstitute and serve the backup files on demand. UPDATE: The full + incremental mariabackups feature has now been included here without restic.

££: Replication works by fetching mariadb binlogs. Binlogs are retained on the primary only for a few days (by default). So if when creating a replica the primary is older than the binlog retention time, the replica must first restore itself to a recent backup of the primary to bootstrap replication.

£££: To request a mariadb replica — either standalone or as a sub-instance of ERP5 (§):

   'replication': {
     'upstream-mariadb-url': 'mysql://<user>:<password>@<ip>:<port>',
     'upstream-mariabackup-url': 'http(s)://<recent-mariabackup-of-primary>',
   }

or

   'replication': {
     'upstream-mariadb-url': 'mysql://<user>:<password>@<ip>:<port>',
     'upstream-bootstrap-url': 'http(s)://<recent-sqldump-backup-of-primary>',
   }

This takes effect on mariadb database creation - when no data exists yet. That way existing data cannot be deleted by setting or changing the replication parameters after the fact.

A promise checks that the state of the running mariadb matches the requested state (replica/primary, replication source); but if not, the mariadb database will not automatically converge without human intervention once ~/srv/mariadb directory exists.

The bootstrap-url or mariabackup-url may be omitted: this skips replication bootstrap and requires that all binlogs be still available on the primary. This is useful when the primary is recent and may not have a ready backup for bootstrap yet.

The primary mariadb publishes the needed parameters under replication-primary-url, replication-bootstrap-url, and replication-mariabackup-url. They can then be plugged directly into the replica request.

££££: If the replica is accessed over TLS IPv6, the caucased-url of the primary on which the replica will request a certificate must be passed as well:

   'replication': {
     'upstream-mariadb-url': 'mysql://<user>:<password>@<ipv6>:<port>',
     'upstream-mariabackup-url': 'http(s)://<recent-mariabackup-of-primary>',
     'upstream-caucased-url': 'http://[<ipv6>]:<port>',
   }

The replica will then publish a CSR under caucased-csr-to-sign — the ERP5 root instance (if there is one ) will republish it (§§). To make the primary caucased sign it, it can be passed back to the primary:

   'caucased': {
     'csr-to-sign': '<PEM-content>',
   }

£££££: For many ERP5 uses cases to work correctly (accurate stock evaluation, activities, ...), the ZODB (neo) and the index catalog (mariadb) must be coherent with each other. This coherence is maintained by the zope processes and the activity queue. At the time a takeover is needed, most likely the replica mariadb and replica neo will not be coherent with each other. One way to reattain coherence is to regenerate the mariadb catalog from scratch by re-indexing the whole ZODB; this is a very lengthy process that can take days or weeks, which makes it unsuitable in practice. Our practical "state-of-the-art" solution is to truncate the neo to its state a few minutes back in time; enough minutes to be certain that all the ZODB objects created and modified prior to that truncation point are correctly indexed in the non-truncated mariadb. Then it's only a matter of examining the indexations in mariadb that occurred in the interval between the truncation time and the most recent state of mariadb to determine which remain valid. This is done by ERP5Site_resynchroniseCatalogSince. Given that that only a few minutes need to be examined, this process is very fast. Thus this technique trades a few minutes of data in the past for the ability to be up and running again a short time in the future.

§: To request a ERP5 with a mariadb replica sub-instance, the same parameters can be forwarded from ERP5 root instance to mariadb by wrapping them in a 'mariadb' dict:

   'mariadb': {
      'replication': { '...' },
      'caucased': { '...' }
   }

§§: The ERP5 root instance (when mariadb is not standalone) will republish the needed parameters by prefixing them with 'mariadb-', e.g. mariadb-replication-primary-url, mariadb-caucased-url, mariadb-caucased-csr-to-sign.

Edited Jul 29, 2025 by Xavier Thompson
Assignee
Assign to
Reviewer
Request review from
None
Milestone
None
Assign milestone
Time tracking
Source branch: feat/mariadb-replication
GitLab Nexedi Edition | About GitLab | About Nexedi | 沪ICP备2021021310号-2 | 沪ICP备2021021310号-7