neo/storage/handlers/storage.py · b3dd69730cf0e4273e1be33ee3a5ee382836b3b3 · nexedi / neoppod

Optimize resumption of replication by starting from a greater TID · b3dd6973

Julien Muchembled authored May 17, 2018

Although data that are already transferred aren't transferred again, checking
that the data are there for a whole partition can still be a lot of work for
big databases. This commit is a major performance improvement in that a storage
node that gets disconnected for a short time now gets fully operational quite
instantaneously because it only has to replicate the new data. Before, the time
to recover depended on the size of the DB.

For OUT_OF_DATE cells, the difficult part was that they are writable and
can then contain holes, so we can't just take the last TID in trans/obj
(we wrongly did that at the beginning, and then committed
6b1f198f as a workaround). We solve that
by storing up to where it was up-to-date: this value is initialized from
the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING.

There's actually one such OUT_OF_DATE TID per assigned cell (backends store
these values in the 'pt' table). Otherwise, a cell that still has a lot to
replicate would still cause all other cells to resume from the a very small
TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node
(as a result of tweak).

For UP_TO_DATE cells of a backup cluster, replication was resumed from the
maximum TID at which all assigned cells are known to be fully replicated.
Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of
extra work for others, the worst case being when setting up a backup cluster
(it always restarted from ZERO_TID as long as at least 1 cell was still empty).
Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to
store extra information: we simply look at the last TIDs in trans/obj.
We even handle trans & obj independently, to minimize the work in 1 table
(i.e. trans since it's processed first) if the other is late (obj).

There's a small change in the protocol so that OUT_OF_DATE enum value equals 0.
This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column
as the cell state.

Note about MySQL changes in commit ca58ccd7:
what we did as a workaround is not one any more. Now, we do so much on Python
side that it's unlikely we could reduce the number of queries using GROUP BY.
We even stopped doing that for SQLite.

b3dd6973

storage.py 11.3 KB

Replace storage.py