Commit 8e3c7b01 by Julien Muchembled

Implements backup using specialised storage nodes and relying on replication

Replication is also fully reimplemented:
- It is not done anymore on whole partitions.
- It runs at lowest priority not to degrades performance for client nodes.

Schema of MySQL table is changed to optimize storage layout: rows are now
grouped by age, for good partial replication performance.
This certainly also speeds up simple loads/stores.
1 parent 75d83690
Showing 54 changed files with 1202 additions and 1434 deletions
......@@ -111,42 +111,17 @@ RC - Review output of pylint (CODE)
consider using query(request, args) instead of query(request % args)
- Make listening address and port optionnal, and if they are not provided
listen on all interfaces on any available port.
- Replication throttling (HIGH AVAILABILITY)
In its current implementation, replication runs at full speed, which
degrades performance for client nodes. Replication should allow
throttling, and that throttling should be configurable.
See "Replication pipelining".
- Make replication speed configurable (HIGH AVAILABILITY)
In its current implementation, replication runs at lowest priority, not to
degrades performance for client nodes. But when there's only 1 storage
left for a partition, it may be wanted to guarantee a minimum speed to
avoid complete data loss if another failure happens too early.
- Pack segmentation & throttling (HIGH AVAILABILITY)
In its current implementation, pack runs in one call on all storage nodes
at the same time, which lcoks down the whole cluster. This task should
be split in chunks and processed in "background" on storage nodes.
Packing throttling should probably be at the lowest possible priority
(below interactive use and below replication).
- Replication pipelining (SPEED)
Replication work currently with too many exchanges between replicating
storage, and network latency can become a significant limit.
This should be changed to have just one initial request from
replicating storage, and multiple packets from reference storage with
database range checksums. When receiving these checksums, replicating
storage must compare with what it has, and ask row lists (might not even
be required) and data when there are differences. Quick fetching from
network with asynchronous checking (=queueing) + congestion control
(asking reference storage's to pause its packet flow) will probably be
required.
This should make it easier to throttle replication workload on reference
storage node, as it can decide to postpone replication-related packets on
its own.
- Partial replication (SPEED)
In its current implementation, replication always happens on a whole
partition. In typical use, only a few last transactions will have been
missed, so replicating only past a given TID would be much faster.
To achieve this, storage nodes must store 2 values:
- a pack identifier, which must be different each time a pack occurs
(increasing number sequence, TID-ish, etc) to trigger a
whole-partition replication when a pack happened (this could be
improved too, later)
- the latest (-ish) transaction committed locally, to use as a lower
replication boundary
- tpc_finish failures propagation to master (FUNCTIONALITY)
When asked to lock transaction data, if something goes wrong the master
node must be informed.
......
......@@ -9,7 +9,7 @@ SQL commands to migrate each storage from NEO 0.10.x::
CREATE TABLE new_data (id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, hash BINARY(20) NOT NULL UNIQUE, compression TINYINT UNSIGNED NULL, value LONGBLOB NULL) ENGINE = InnoDB SELECT DISTINCT obj.hash as hash, compression, value FROM obj, data WHERE obj.hash=data.hash ORDER BY serial;
DROP TABLE data;
RENAME TABLE new_data TO data;
CREATE TABLE new_obj (partition SMALLINT UNSIGNED NOT NULL, oid BIGINT UNSIGNED NOT NULL, serial BIGINT UNSIGNED NOT NULL, data_id BIGINT UNSIGNED NULL, value_serial BIGINT UNSIGNED NULL, PRIMARY KEY (partition, oid, serial), KEY (data_id)) ENGINE = InnoDB SELECT partition, oid, serial, data.id as data_id, value_serial FROM obj LEFT JOIN data ON (obj.hash=data.hash);
CREATE TABLE new_obj (partition SMALLINT UNSIGNED NOT NULL, oid BIGINT UNSIGNED NOT NULL, serial BIGINT UNSIGNED NOT NULL, data_id BIGINT UNSIGNED NULL, value_serial BIGINT UNSIGNED NULL, PRIMARY KEY (partition, serial, oid), KEY (partition, oid, serial), KEY (data_id)) ENGINE = InnoDB SELECT partition, oid, serial, data.id as data_id, value_serial FROM obj LEFT JOIN data ON (obj.hash=data.hash);
DROP TABLE obj;
RENAME TABLE new_obj TO obj;
ALTER TABLE tobj CHANGE hash data_id BIGINT UNSIGNED NULL;
......
......@@ -959,7 +959,7 @@ class Application(object):
tid_list = []
# request a tid list for each partition
for offset in xrange(self.pt.getPartitions()):
p = Packets.AskTIDsFrom(start, stop, limit, [offset])
p = Packets.AskTIDsFrom(start, stop, limit, offset)
for node, conn in self.cp.iterateForObject(offset, readable=True):
try:
r = self._askStorage(conn, p)
......
......@@ -90,3 +90,8 @@ class ConfigurationManager(object):
# only from command line
return util.bin(self.argument_list.get('uuid', None))
def getUpstreamCluster(self):
return self.__get('upstream_cluster', True)
def getUpstreamMasters(self):
return util.parseMasterList(self.__get('upstream_masters'))
......@@ -79,6 +79,9 @@ class EpollEventManager(object):
self.epoll.unregister(fd)
del self.connection_dict[fd]
def isIdle(self):
return not (self._pending_processing or self.writer_set)
def _addPendingConnection(self, conn):
pending_processing = self._pending_processing
if conn not in pending_processing:
......
......@@ -48,6 +48,7 @@ class ErrorCodes(Enum):
PROTOCOL_ERROR = Enum.Item(4)
BROKEN_NODE = Enum.Item(5)
ALREADY_PENDING = Enum.Item(7)
REPLICATION_ERROR = Enum.Item(8)
ErrorCodes = ErrorCodes()
class ClusterStates(Enum):
......@@ -55,6 +56,9 @@ class ClusterStates(Enum):
VERIFYING = Enum.Item(2)
RUNNING = Enum.Item(3)
STOPPING = Enum.Item(4)
STARTING_BACKUP = Enum.Item(5)
BACKINGUP = Enum.Item(6)
STOPPING_BACKUP = Enum.Item(7)
ClusterStates = ClusterStates()
class NodeTypes(Enum):
......@@ -117,6 +121,7 @@ ZERO_TID = '\0' * 8
ZERO_OID = '\0' * 8
OID_LEN = len(INVALID_OID)
TID_LEN = len(INVALID_TID)
MAX_TID = '\x7f' + '\xff' * 7 # SQLite does not accept numbers above 2^63-1
UUID_NAMESPACES = {
NodeTypes.STORAGE: 'S',
......@@ -723,6 +728,7 @@ class LastIDs(Packet):
POID('last_oid'),
PTID('last_tid'),
PPTID('last_ptid'),
PTID('backup_tid'),
)
class PartitionTable(Packet):
......@@ -760,16 +766,6 @@ class PartitionChanges(Packet):
),
)
class ReplicationDone(Packet):
"""
Notify the master node that a partition has been successully replicated from
a storage to another.
S -> M
"""
_fmt = PStruct('notify_replication_done',
PNumber('offset'),
)
class StartOperation(Packet):
"""
Tell a storage nodes to start an operation. Until a storage node receives
......@@ -965,7 +961,7 @@ class GetObject(Packet):
"""
Ask a stored object by its OID and a serial or a TID if given. If a serial
is specified, the specified revision of an object will be returned. If
a TID is specified, an object right before the TID will be returned. S,C -> S.
a TID is specified, an object right before the TID will be returned. C -> S.
Answer the requested object. S -> C.
"""
_fmt = PStruct('ask_object',
......@@ -1003,16 +999,14 @@ class TIDList(Packet):
class TIDListFrom(Packet):
"""
Ask for length TIDs starting at min_tid. The order of TIDs is ascending.
S -> S.
Answer the requested TIDs. S -> S
C -> S.
Answer the requested TIDs. S -> C
"""
_fmt = PStruct('tid_list_from',
PTID('min_tid'),
PTID('max_tid'),
PNumber('length'),
PList('partition_list',
PNumber('partition'),
),
PNumber('partition'),
)
_answer = PStruct('answer_tids',
......@@ -1054,27 +1048,6 @@ class ObjectHistory(Packet):
PFHistoryList,
)
class ObjectHistoryFrom(Packet):
"""
Ask history information for a given object. The order of serials is
ascending, and starts at (or above) min_serial for min_oid. S -> S.
Answer the requested serials. S -> S.
"""
_fmt = PStruct('ask_object_history',
POID('min_oid'),
PTID('min_serial'),
PTID('max_serial'),
PNumber('length'),
PNumber('partition'),
)
_answer = PStruct('ask_finish_transaction',
PDict('object_dict',
POID('oid'),
PFTidList,
),
)
class PartitionList(Packet):
"""
All the following messages are for neoctl to admin node
......@@ -1341,6 +1314,110 @@ class NotifyReady(Packet):
"""
pass
# replication
class FetchTransactions(Packet):
"""
S -> S
"""
_fmt = PStruct('ask_transaction_list',
PNumber('partition'),
PNumber('length'),
PTID('min_tid'),
PTID('max_tid'),
PFTidList, # already known transactions
)
_answer = PStruct('answer_transaction_list',
PTID('pack_tid'),
PTID('next_tid'),
PFTidList, # transactions to delete
)
class AddTransaction(Packet):
"""
S -> S
"""
_fmt = PStruct('add_transaction',
PTID('tid'),
PString('user'),
PString('description'),
PString('extension'),
PBoolean('packed'),
PFOidList,
)
class FetchObjects(Packet):
"""
S -> S
"""
_fmt = PStruct('ask_object_list',
PNumber('partition'),
PNumber('length'),
PTID('min_tid'),
PTID('max_tid'),
POID('min_oid'),
PDict('object_dict', # already known objects
PTID('serial'),
PFOidList,
),
)
_answer = PStruct('answer_object_list',
PTID('pack_tid'),
PTID('next_tid'),
POID('next_oid'),
PDict('object_dict', # objects to delete
PTID('serial'),
PFOidList,
),
)
class AddObject(Packet):
"""
S -> S
"""
_fmt = PStruct('add_object',
POID('oid'),
PTID('serial'),
PBoolean('compression'),
PChecksum('checksum'),
PString('data'),
PTID('data_serial'),
)
class Replicate(Packet):
"""
M -> S
"""
_fmt = PStruct('replicate',
PTID('tid'),
PString('upstream_name'),
PDict('source_dict',
PNumber('partition'),
PAddress('address'),
)
)
class ReplicationDone(Packet):
"""
Notify the master node that a partition has been successully replicated from
a storage to another.
S -> M
"""
_fmt = PStruct('notify_replication_done',
PNumber('offset'),
PTID('tid'),
)
class Truncate(Packet):
"""
M -> S
"""
_fmt = PStruct('ask_truncate',
PTID('tid'),
)
_answer = PFEmpty
StaticRegistry = {}
def register(request, ignore_when_closed=None):
""" Register a packet in the packet registry """
......@@ -1516,16 +1593,12 @@ class Packets(dict):
ClusterState)
NotifyLastOID = register(
NotifyLastOID)
NotifyReplicationDone = register(
ReplicationDone)
AskObjectUndoSerial, AnswerObjectUndoSerial = register(
ObjectUndoSerial)
AskHasLock, AnswerHasLock = register(
HasLock)
AskTIDsFrom, AnswerTIDsFrom = register(
TIDListFrom)
AskObjectHistoryFrom, AnswerObjectHistoryFrom = register(
ObjectHistoryFrom)
AskPack, AnswerPack = register(
Pack, ignore_when_closed=False)
AskCheckTIDRange, AnswerCheckTIDRange = register(
......@@ -1540,6 +1613,20 @@ class Packets(dict):
CheckCurrentSerial)
NotifyTransactionFinished = register(
NotifyTransactionFinished)
Replicate = register(
Replicate)
NotifyReplicationDone = register(
ReplicationDone)
AskFetchTransactions, AnswerFetchTransactions = register(
FetchTransactions)
AskFetchObjects, AnswerFetchObjects = register(
FetchObjects)
AddTransaction = register(
AddTransaction)
AddObject = register(
AddObject)
AskTruncate, AnswerTruncate = register(
Truncate)
def Errors():
registry_dict = {}
......
......@@ -150,6 +150,11 @@ class PartitionTable(object):
return True
return False
def getCell(self, offset, uuid):
for cell in self.partition_list[offset]:
if cell.getUUID() == uuid:
return cell
def setCell(self, offset, node, state):
if state == CellStates.DISCARDED:
return self.removeCell(offset, node)
......@@ -157,28 +162,19 @@ class PartitionTable(object):
raise PartitionTableException('Invalid node state')
self.count_dict.setdefault(node, 0)
row = self.partition_list[offset]
if len(row) == 0:
# Create a new row.
row = [Cell(node, state), ]
if state != CellStates.FEEDING:
self.count_dict[node] += 1
self.partition_list[offset] = row
self.num_filled_rows += 1
for cell in self.partition_list[offset]:
if cell.getNode() is node:
if not cell.isFeeding():
self.count_dict[node] -= 1
cell.setState(state)
break
else:
# XXX this can be slow, but it is necessary to remove a duplicate,
# if any.
for cell in row:
if cell.getNode() == node:
row.remove(cell)
if not cell.isFeeding():
self.count_dict[node] -= 1
break
row = self.partition_list[offset]
self.num_filled_rows += not row
row.append(Cell(node, state))
if state != CellStates.FEEDING:
self.count_dict[node] += 1
return (offset, node.getUUID(), state)
if state != CellStates.FEEDING:
self.count_dict[node] += 1
return offset, node.getUUID(), state
def removeCell(self, offset, node):
row = self.partition_list[offset]
......
......@@ -28,6 +28,10 @@ from neo.lib.event import EventManager
from neo.lib.connection import ListeningConnection, ClientConnection
from neo.lib.exception import ElectionFailure, PrimaryFailure, OperationFailure
from neo.lib.util import dump
class StateChangedException(Exception): pass
from .backup_app import BackupApplication
from .handlers import election, identification, secondary
from .handlers import administration, client, storage, shutdown
from .pt import PartitionTable
......@@ -41,6 +45,8 @@ class Application(object):
packing = None
# Latest completely commited TID
last_transaction = ZERO_TID
backup_tid = None
backup_app = None
def __init__(self, config):
# Internal attributes.
......@@ -90,16 +96,29 @@ class Application(object):
self._current_manager = None
# backup
upstream_cluster = config.getUpstreamCluster()
if upstream_cluster:
if upstream_cluster == self.name:
raise ValueError("upstream cluster name must be"
" different from cluster name")
self.backup_app = BackupApplication(self, upstream_cluster,
*config.getUpstreamMasters())
registerLiveDebugger(on_log=self.log)
def close(self):
self.listening_conn = None
if self.backup_app is not None:
self.backup_app.close()
self.nm.close()
self.em.close()
del self.__dict__
def log(self):
self.em.log()
if self.backup_app is not None:
self.backup_app.log()
self.nm.log()
self.tm.log()
if self.pt is not None:
......@@ -257,27 +276,29 @@ class Application(object):
a shutdown.
"""
neo.lib.logging.info('provide service')
em = self.em
poll = self.em.poll
self.tm.reset()
self.changeClusterState(ClusterStates.RUNNING)
# Now everything is passive.
while True:
try:
em.poll(1)
except OperationFailure:
# If not operational, send Stop Operation packets to storage
# nodes and client nodes. Abort connections to client nodes.
neo.lib.logging.critical('No longer operational')
for node in self.nm.getIdentifiedList():
if node.isStorage() or node.isClient():
node.notify(Packets.StopOperation())
if node.isClient():
node.getConnection().abort()
# Then, go back, and restart.
return
try:
while True:
poll(1)
except OperationFailure:
# If not operational, send Stop Operation packets to storage
# nodes and client nodes. Abort connections to client nodes.
neo.lib.logging.critical('No longer operational')
except StateChangedException, e:
assert e.args[0] == ClusterStates.STARTING_BACKUP
self.backup_tid = tid = self.getLastTransaction()
self.pt.setBackupTidDict(dict((node.getUUID(), tid)
for node in self.nm.getStorageList(only_identified=True)))
for node in self.nm.getIdentifiedList():
if node.isStorage() or node.isClient():
node.notify(Packets.StopOperation())
if node.isClient():
node.getConnection().abort()
def playPrimaryRole(self):
neo.lib.logging.info(
......@@ -314,7 +335,13 @@ class Application(object):
self.runManager(RecoveryManager)
while True:
self.runManager(VerificationManager)
self.provideService()
if self.backup_tid:
if self.backup_app is None:
raise RuntimeError("No upstream cluster to backup"
" defined in configuration")
self.backup_app.provideService()
else:
self.provideService()
def playSecondaryRole(self):
"""
......@@ -364,7 +391,8 @@ class Application(object):
# select the storage handler
client_handler = client.ClientServiceHandler(self)
if state == ClusterStates.RUNNING:
if state in (ClusterStates.RUNNING, ClusterStates.STARTING_BACKUP,
ClusterStates.BACKINGUP, ClusterStates.STOPPING_BACKUP):
storage_handler = storage.StorageServiceHandler(self)
elif self._current_manager is not None:
storage_handler = self._current_manager.getHandler()
......@@ -389,8 +417,9 @@ class Application(object):
handler = storage_handler
else:
continue # keep handler
conn.setHandler(handler)
handler.connectionCompleted(conn)
if type(handler) is not type(conn.getLastHandler()):
conn.setHandler(handler)
handler.connectionCompleted(conn)
self.cluster_state = state
def getNewUUID(self, node_type):
......@@ -437,19 +466,13 @@ class Application(object):
sys.exit()
def identifyStorageNode(self, uuid, node):
state = NodeStates.RUNNING
handler = None
if self.cluster_state == ClusterStates.RUNNING:
if uuid is None or node is None:
# same as for verification
state = NodeStates.PENDING
handler = storage.StorageServiceHandler(self)
elif self.cluster_state == ClusterStates.STOPPING:
if self.cluster_state == ClusterStates.STOPPING:
raise NotReadyError
else:
raise RuntimeError('unhandled cluster state: %s' %
(self.cluster_state, ))
return (uuid, state, handler)
state = NodeStates.RUNNING
if uuid is None or node is None:
# same as for verification
state = NodeStates.PENDING
return uuid, state, storage.StorageServiceHandler(self)
def identifyNode(self, node_type, uuid, node):
......
......@@ -18,15 +18,18 @@
import neo
from . import MasterHandler
from ..app import StateChangedException
from neo.lib.protocol import ClusterStates, NodeStates, Packets, ProtocolError
from neo.lib.protocol import Errors
from neo.lib.util import dump
CLUSTER_STATE_WORKFLOW = {
# destination: sources
ClusterStates.VERIFYING: set([ClusterStates.RECOVERING]),
ClusterStates.STOPPING: set([ClusterStates.RECOVERING,
ClusterStates.VERIFYING, ClusterStates.RUNNING]),
ClusterStates.VERIFYING: (ClusterStates.RECOVERING,),
ClusterStates.STARTING_BACKUP: (ClusterStates.RUNNING,
ClusterStates.STOPPING_BACKUP),
ClusterStates.STOPPING_BACKUP: (ClusterStates.BACKINGUP,
ClusterStates.STARTING_BACKUP),
}
class AdministrationHandler(MasterHandler):
......@@ -42,16 +45,17 @@ class AdministrationHandler(MasterHandler):
conn.answer(Packets.AnswerPrimary(app.uuid, []))
def setClusterState(self, conn, state):
app = self.app
# check request
if state not in CLUSTER_STATE_WORKFLOW:
try:
if app.cluster_state not in CLUSTER_STATE_WORKFLOW[state]:
raise ProtocolError('Can not switch to this state')
except KeyError:
raise ProtocolError('Invalid state requested')
valid_current_states = CLUSTER_STATE_WORKFLOW[state]
if self.app.cluster_state not in valid_current_states:
raise ProtocolError('Cannot switch to this state')
# change state
if state == ClusterStates.VERIFYING:
storage_list = self.app.nm.getStorageList(only_identified=True)
storage_list = app.nm.getStorageList(only_identified=True)
if not storage_list:
raise ProtocolError('Cannot exit recovery without any '
'storage node')
......@@ -60,15 +64,18 @@ class AdministrationHandler(MasterHandler):
if node.getConnection().isPending():
raise ProtocolError('Cannot exit recovery now: node %r is '
'entering cluster' % (node, ))
self.app._startup_allowed = True
else:
self.app.changeClusterState(state)
app._startup_allowed = True
state = app.cluster_state
elif state == ClusterStates.STARTING_BACKUP:
if app.tm.hasPending() or app.nm.getClientList(True):
raise ProtocolError("Can not switch to %s state with pending"
" transactions or connected clients" % state)
elif state != ClusterStates.STOPPING_BACKUP:
app.changeClusterState(state)
# answer
conn.answer(Errors.Ack('Cluster state changed'))
if state == ClusterStates.STOPPING:
self.app.cluster_state = state
self.app.shutdown()
if state != app.cluster_state:
raise StateChangedException(state)
def setNodeState(self, conn, uuid, state, modify_partition_table):
neo.lib.logging.info("set node state for %s-%s : %s" %
......
##############################################################################
#
# Copyright (c) 2011 Nexedi SARL and Contributors. All Rights Reserved.
# Julien Muchembled <jm@nexedi.com>
#
# WARNING: This program as such is intended to be used by professional
# programmers who take the whole responsibility of assessing all potential
# consequences resulting from its eventual inadequacies and bugs
# End users who are looking for a ready-to-use solution with commercial
# guarantees and support are strongly advised to contract a Free Software
# Service Company
#
# This program is Free Software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#
##############################################################################
from neo.lib.exception import PrimaryFailure
from neo.lib.handler import EventHandler
from neo.lib.protocol import CellStates
class BackupHandler(EventHandler):
"""Handler dedicated to upstream master during BACKINGUP state"""
def connectionLost(self, conn, new_state):
if self.app.app.listening_conn: # if running
raise PrimaryFailure('connection lost')
def answerPartitionTable(self, conn, ptid, row_list):
self.app.pt.load(ptid, row_list, self.app.nm)
def notifyPartitionChanges(self, conn, ptid, cell_list):
self.app.pt.update(ptid, cell_list, self.app.nm)
def answerNodeInformation(self, conn):
pass
def notifyNodeInformation(self, conn, node_list):
self.app.nm.update(node_list)
def answerLastTransaction(self, conn, tid):
app = self.app
app.invalidatePartitions(tid, set(xrange(app.pt.getPartitions())))
def invalidateObjects(self, conn, tid, oid_list):
app = self.app
getPartition = app.app.pt.getPartition
partition_set = set(map(getPartition, oid_list))
partition_set.add(getPartition(tid))
app.invalidatePartitions(tid, partition_set)
......@@ -16,7 +16,7 @@
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
import neo.lib
from neo.lib.protocol import Packets, ProtocolError
from neo.lib.protocol import ClusterStates, Packets, ProtocolError
from neo.lib.exception import OperationFailure
from neo.lib.util import dump
from neo.lib.connector import ConnectorConnectionClosedException
......@@ -45,14 +45,18 @@ class StorageServiceHandler(BaseServiceHandler):
if not app.pt.operational():
raise OperationFailure, 'cannot continue operation'
app.tm.forget(conn.getUUID())
if app.getClusterState() == ClusterStates.BACKINGUP:
app.backup_app.nodeLost(node)
if app.packing is not None:
self.answerPack(conn, False)
def askLastIDs(self, conn):
app = self.app
loid = app.tm.getLastOID()
ltid = app.tm.getLastTID()
conn.answer(Packets.AnswerLastIDs(loid, ltid, app.pt.getID()))
conn.answer(Packets.AnswerLastIDs(
app.tm.getLastOID(),
app.tm.getLastTID(),
app.pt.getID(),
app.backup_tid))
def askUnfinishedTransactions(self, conn):
tm = self.app.tm
......@@ -68,15 +72,26 @@ class StorageServiceHandler(BaseServiceHandler):
# transaction locked on this storage node
self.app.tm.lock(ttid, conn.getUUID())
def notifyReplicationDone(self, conn, offset):
node = self.app.nm.getByUUID(conn.getUUID())
neo.lib.logging.debug("%s is up for offset %s" % (node, offset))
try:
cell_list = self.app.pt.setUpToDate(node, offset)
except PartitionTableException, e:
raise ProtocolError(str(e))
def notifyReplicationDone(self, conn, offset, tid):
app = self.app
node = app.nm.getByUUID(conn.getUUID())
if app.backup_tid:
cell_list = app.backup_app.notifyReplicationDone(node, offset, tid)
if not cell_list:
return
else:
try:
cell_list = self.app.pt.setUpToDate(node, offset)
if not cell_list:
raise ProtocolError('Non-oudated partition')
except PartitionTableException, e:
raise ProtocolError(str(e))
neo.lib.logging.debug("%s is up for offset %s", node, offset)
self.app.broadcastPartitionChanges(cell_list)
def answerTruncate(self, conn):
pass
def answerPack(self, conn, status):
app = self.app
if app.packing is not None:
......
......@@ -17,11 +17,25 @@
import neo.lib.pt
from struct import pack, unpack
from neo.lib.protocol import CellStates
from neo.lib.pt import PartitionTableException
from neo.lib.pt import PartitionTable
from neo.lib.protocol import CellStates, ZERO_TID
class PartitionTable(PartitionTable):
class Cell(neo.lib.pt.Cell):
replicating = ZERO_TID
def setState(self, state):
try:
if CellStates.OUT_OF_DATE == state != self.state:
del self.backup_tid, self.replicating
except AttributeError:
pass
return super(Cell, self).setState(state)
neo.lib.pt.Cell = Cell
class PartitionTable(neo.lib.pt.PartitionTable):
"""This class manages a partition table for the primary master node"""
def setID(self, id):
......@@ -54,7 +68,7 @@ class PartitionTable(PartitionTable):
row = []
for _ in xrange(repeats):
node = node_list[index]
row.append(neo.lib.pt.Cell(node))
row.append(Cell(node))
self.count_dict[node] = self.count_dict.get(node, 0) + 1
index += 1
if index == len(node_list):
......@@ -88,7 +102,7 @@ class PartitionTable(PartitionTable):
node_list = [c.getNode() for c in row]
n = self.findLeastUsedNode(node_list)
if n is not None:
row.append(neo.lib.pt.Cell(n,
row.append(Cell(n,
CellStates.OUT_OF_DATE))
self.count_dict[n] += 1
cell_list.append((offset, n.getUUID(),
......@@ -132,11 +146,11 @@ class PartitionTable(PartitionTable):
# check the partition is assigned and known as outdated
for cell in self.getCellList(offset):
if cell.getUUID() == uuid:
if not cell.isOutOfDate():
raise PartitionTableException('Non-oudated partition')
break
if cell.isOutOfDate():
break
return
else:
raise PartitionTableException('Non-assigned partition')
raise neo.lib.pt.PartitionTableException('Non-assigned partition')
# update the partition table
cell_list = [self.setCell(offset, node, CellStates.UP_TO_DATE)]
......@@ -177,7 +191,7 @@ class PartitionTable(PartitionTable):
else:
if num_cells <= self.nr:
row.append(neo.lib.pt.Cell(node, CellStates.OUT_OF_DATE))
row.append(Cell(node, CellStates.OUT_OF_DATE))
cell_list.append((offset, node.getUUID(),
CellStates.OUT_OF_DATE))
node_count += 1
......@@ -196,7 +210,7 @@ class PartitionTable(PartitionTable):
CellStates.FEEDING))
# Don't count a feeding cell.
self.count_dict[max_cell.getNode()] -= 1
row.append(neo.lib.pt.Cell(node, CellStates.OUT_OF_DATE))
row.append(Cell(node, CellStates.OUT_OF_DATE))
cell_list.append((offset, node.getUUID(),
CellStates.OUT_OF_DATE))
node_count += 1
......@@ -277,7 +291,7 @@ class PartitionTable(PartitionTable):
node = self.findLeastUsedNode([cell.getNode() for cell in row])
if node is None:
break
row.append(neo.lib.pt.Cell(node, CellStates.OUT_OF_DATE))
row.append(Cell(node, CellStates.OUT_OF_DATE))
changed_cell_list.append((offset, node.getUUID(),
CellStates.OUT_OF_DATE))
self.count_dict[node] += 1
......@@ -309,6 +323,13 @@ class PartitionTable(PartitionTable):
CellStates.OUT_OF_DATE))
return change_list
def iterNodeCell(self, node):
for offset, row in enumerate(self.partition_list):
for cell in row:
if cell.getNode() is node:
yield offset, cell
break
def getUpToDateCellNodeSet(self):
"""
Return a set of all nodes which are part of at least one UP TO DATE
......@@ -329,3 +350,16 @@ class PartitionTable(PartitionTable):
for cell in row
if cell.isOutOfDate())
def setBackupTidDict(self, backup_tid_dict):
for row in self.partition_list:
for cell in row:
cell.backup_tid = backup_tid_dict.get(cell.getUUID(),
ZERO_TID)
def getBackupTid(self):
try:
return min(max(cell.backup_tid for cell in row
if not cell.isOutOfDate())
for row in self.partition_list)
except ValueError:
return ZERO_TID
......@@ -33,6 +33,7 @@ class RecoveryManager(MasterHandler):
super(RecoveryManager, self).__init__(