• Julien Muchembled's avatar
    Optimize resumption of replication by starting from a greater TID · b3dd6973
    Julien Muchembled authored
    Although data that are already transferred aren't transferred again, checking
    that the data are there for a whole partition can still be a lot of work for
    big databases. This commit is a major performance improvement in that a storage
    node that gets disconnected for a short time now gets fully operational quite
    instantaneously because it only has to replicate the new data. Before, the time
    to recover depended on the size of the DB.
    
    For OUT_OF_DATE cells, the difficult part was that they are writable and
    can then contain holes, so we can't just take the last TID in trans/obj
    (we wrongly did that at the beginning, and then committed
    6b1f198f as a workaround). We solve that
    by storing up to where it was up-to-date: this value is initialized from
    the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING.
    
    There's actually one such OUT_OF_DATE TID per assigned cell (backends store
    these values in the 'pt' table). Otherwise, a cell that still has a lot to
    replicate would still cause all other cells to resume from the a very small
    TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node
    (as a result of tweak).
    
    For UP_TO_DATE cells of a backup cluster, replication was resumed from the
    maximum TID at which all assigned cells are known to be fully replicated.
    Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of
    extra work for others, the worst case being when setting up a backup cluster
    (it always restarted from ZERO_TID as long as at least 1 cell was still empty).
    Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to
    store extra information: we simply look at the last TIDs in trans/obj.
    We even handle trans & obj independently, to minimize the work in 1 table
    (i.e. trans since it's processed first) if the other is late (obj).
    
    There's a small change in the protocol so that OUT_OF_DATE enum value equals 0.
    This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column
    as the cell state.
    
    Note about MySQL changes in commit ca58ccd7:
    what we did as a workaround is not one any more. Now, we do so much on Python
    side that it's unlikely we could reduce the number of queries using GROUP BY.
    We even stopped doing that for SQLite.
    b3dd6973
storage.py 11.3 KB