- 27 Apr, 2019 2 commits
-
-
Julien Muchembled authored
neoctl gets a new command to change the number of replicas. The number of replicas becomes a new partition table attribute and like the PT id, it is stored in the config table. On the other side, the configuration value for the number of partitions is dropped, since it can be computed from the partition table, which is always stored in full. The -p/-r master options now only apply at database creation. Some implementation notes: - The protocol is slightly optimized in that the master now sends automatically the whole partition tables to the admin & client nodes upon connection, like for storage nodes. This makes the protocol more consistent, and the master is the only remaining node requesting partition tables, during recovery. - Some parts become tricky because app.pt can be None in more cases. For example, the extra condition in NodeManager.update (before app.pt.dropNode) was added for this is the reason. Or the 'loadPartitionTable' method (storage) that is not inlined because of unit tests. Overall, this commit simplifies more than it complicates. - In the master handlers, we stop hijacking the 'connectionCompleted' method for tasks to be performed (often send the full partition table) on handler switches. - The admin's 'bootstrapped' flag could have been removed earlier: race conditions can't happen since the AskNodeInformation packet was removed (commit d048a52d).
-
Julien Muchembled authored
It is often faster to set up replicas by stopping a node (and any underlying database server like MariaDB) and do a raw copy of the database (e.g. with rsync). So far, it required to stop the whole cluster and use tools like 'mysql' or sqlite3' to edit: - the 'pt' table in databases, - the 'config.nid' values of the new nodes. With this new option, if you already have 1 replica, you can set up new replicas with such fast raw copy, and without interruption of service. Obviously, this implies less redundancy during the operation.
-
- 11 Mar, 2019 1 commit
-
-
Julien Muchembled authored
-
- 07 Nov, 2018 1 commit
-
-
Julien Muchembled authored
Without this new mechanism to detect oids that aren't write-locked, a transaction could be committed successfully without detecting conflicts. In the added test, the resulting value was 2, whereas it should be 5 if there was no node failure.
-
- 05 Nov, 2018 1 commit
-
-
Julien Muchembled authored
-
- 06 Sep, 2018 1 commit
-
-
Julien Muchembled authored
Here is what happened after simulating a network failure between a client and a storage: C8 DEBUG recv failed for <SSLSocketConnectorIPv6 at 0x7f8198027f90 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 53970), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECONNRESET (Connection reset by peer) DEBUG connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f81939a0950> DEBUG connection started for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, connecting, client) at 7f8192eb17d0> PACKET #0x0000 RequestIdentification > S2 ([xxxx:xxxx:60:4c2c::25c3]:39085) | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761) DEBUG SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECDHE-RSA-AES256-GCM-SHA384 256 DEBUG connection completed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, client) at 7f8192eb17d0> (from xxxx:xxxx:120:cd8::90a1:54014) DEBUG <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)> closed in recv DEBUG connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f8192eb17d0> ERROR Connection to <StorageNode(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, state=RUNNING, connection=None, not identified) at 7f81a8874690> failed S2 DEBUG accepted a connection from xxxx:xxxx:120:cd8::90a1:54014 DEBUG SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f657144a910 fileno 22 ('xxxx:xxxx:60:4c2c::25c3', 39085), opened from ('xxxx:xxxx:120:cd8::90a1', 54014)>: ECDHE-RSA-AES256-GCM-SHA384 256 DEBUG connection completed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, fd=22, server) at 7f657144a090> (from xxxx:xxxx:60:4c2c::25c3:39085) PACKET #0x0000 RequestIdentification < None ([xxxx:xxxx:120:cd8::90a1]:54014) | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761) DEBUG connection closed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, closed, server) at 7f657144a090> WARNING A connection was lost during identification ERROR Pre-mortem data: ERROR Traceback (most recent call last): ERROR File "neo/storage/app.py", line 194, in run ERROR self._run() ERROR File "neo/storage/app.py", line 225, in _run ERROR self.doOperation() ERROR File "neo/storage/app.py", line 310, in doOperation ERROR poll() ERROR File "neo/storage/app.py", line 134, in _poll ERROR self.em.poll(1) ERROR File "neo/lib/event.py", line 160, in poll ERROR to_process.process() ERROR File "neo/lib/connection.py", line 499, in process ERROR self._handlers.handle(self, self._queue.pop(0)) ERROR File "neo/lib/connection.py", line 85, in handle ERROR self._handle(connection, packet) ERROR File "neo/lib/connection.py", line 100, in _handle ERROR pending[0][1].packetReceived(connection, packet) ERROR File "neo/lib/handler.py", line 123, in packetReceived ERROR self.dispatch(*args) ERROR File "neo/lib/handler.py", line 72, in dispatch ERROR method(conn, *args, **kw) ERROR File "neo/storage/handlers/identification.py", line 56, in requestIdentification ERROR assert not node.isConnected(), node ERROR AssertionError: <ClientNode(uuid=C8, state=RUNNING, connection=<ServerConnection(uuid=C8, address=[xxxx:xxxx:120:cd8::90a1]:53970, handler=ClientOperationHandler, fd=18, on_close=onConnectionClosed, server) at 7f657147d7d0>) at 7f65714d6cd0>
-
- 22 Jun, 2018 2 commits
-
-
Julien Muchembled authored
This commit adds a contraint when tweaking the partition table with replicas, so that cells of each partition are assigned as far as possible from each other, e.g. not on the same machine even if each one has several disks, and in any case not on the same storage device. Currently, the topology path of each node is automatically calculated by the storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev). To be improved: - Add a storage option to override the path: the 'tweak' algorithm can already handle topology paths of any length, so something like (room, machine, disk) could be done easily. - Write OS-specific code to determine the real hardware behind st_dev (e.g. 2 different 'st_dev' values may actually refer to the same disk, because of layers like partitioning, device-mapper, loop, btrfs subvolumes, and so on). - Make 'neoctl' report in some way if the PT is optimal. Meanwhile, if it isn't, the master only logs a WARNING during tweak.
-
Julien Muchembled authored
This is a follow-up of commit b3dd6973 ("Optimize resumption of replication by starting from a greater TID"). I missed the case where a storage node is restarted while it is replicating: it lost the TID where it was interrupted. Although we commit after each replicated chunk, to avoid transferring again all the data from the beginning, it could still waste time to check that the data are already replicated.
-
- 21 Jun, 2018 1 commit
-
-
Julien Muchembled authored
-
- 19 Jun, 2018 1 commit
-
-
Julien Muchembled authored
-
- 30 May, 2018 2 commits
-
-
Julien Muchembled authored
/reviewed-on !9
-
Julien Muchembled authored
Although data that are already transferred aren't transferred again, checking that the data are there for a whole partition can still be a lot of work for big databases. This commit is a major performance improvement in that a storage node that gets disconnected for a short time now gets fully operational quite instantaneously because it only has to replicate the new data. Before, the time to recover depended on the size of the DB. For OUT_OF_DATE cells, the difficult part was that they are writable and can then contain holes, so we can't just take the last TID in trans/obj (we wrongly did that at the beginning, and then committed 6b1f198f as a workaround). We solve that by storing up to where it was up-to-date: this value is initialized from the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING. There's actually one such OUT_OF_DATE TID per assigned cell (backends store these values in the 'pt' table). Otherwise, a cell that still has a lot to replicate would still cause all other cells to resume from the a very small TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node (as a result of tweak). For UP_TO_DATE cells of a backup cluster, replication was resumed from the maximum TID at which all assigned cells are known to be fully replicated. Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of extra work for others, the worst case being when setting up a backup cluster (it always restarted from ZERO_TID as long as at least 1 cell was still empty). Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to store extra information: we simply look at the last TIDs in trans/obj. We even handle trans & obj independently, to minimize the work in 1 table (i.e. trans since it's processed first) if the other is late (obj). There's a small change in the protocol so that OUT_OF_DATE enum value equals 0. This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column as the cell state. Note about MySQL changes in commit ca58ccd7: what we did as a workaround is not one any more. Now, we do so much on Python side that it's unlikely we could reduce the number of queries using GROUP BY. We even stopped doing that for SQLite.
-
- 14 Mar, 2018 1 commit
-
-
Julien Muchembled authored
For records that undo object creation, None values are used at the backend level whereas the protocol is not designed to serialize None for any field. Therefore, a dance done in many places around packet serialization, using the specific 0/ZERO_HASH/'' triplet to represent a deleted oid. For replication, it was missing at the sender side, leading to the following crash: Traceback (most recent call last): File "neo/storage/app.py", line 147, in run self._run() File "neo/storage/app.py", line 178, in _run self.doOperation() File "neo/storage/app.py", line 257, in doOperation next(task_queue[-1]) or task_queue.rotate() File "neo/storage/handlers/storage.py", line 271, in push conn.send(Packets.AddObject(oid, *object), msg_id) File "neo/lib/protocol.py", line 234, in __init__ self._fmt.encode(buf.write, args) File "neo/lib/protocol.py", line 345, in encode return self._trace(self._encode, writer, items) File "neo/lib/protocol.py", line 334, in _trace return method(*args) File "neo/lib/protocol.py", line 367, in _encode item.encode(writer, value) File "neo/lib/protocol.py", line 345, in encode return self._trace(self._encode, writer, items) File "neo/lib/protocol.py", line 342, in _trace raise ParseError(self, trace) ParseError: at add_object/checksum: File "neo/lib/protocol.py", line 553, in _encode assert len(checksum) == 20, (len(checksum), checksum) TypeError: object of type 'NoneType' has no len()
-
- 08 Jan, 2018 1 commit
-
-
Julien Muchembled authored
# Previous status The issue was that we had extreme storage fragmentation from the point of view of the replication algorithm, which processes one partition at a time. By using an autoincrement for the 'data' table, rows were ordered by the time at which they were added: - parts may be the result of replication -> ordered by partition, tid, oid - other rows are globally sorted by tid Which means that when scanning a given partition, many rows were skipped all the time: - if readahead is bigger enough, the efficiency is 1/N for a node with N partitions assigned - else, it is worse because it seeks all the time For huge databases, the replication was horribly slow, in particular from HDD. # Chosen solution This commit changes how ids are generated to somehow split 'data' per partition. The backend tracks 1 last id per assigned partition, where the 16 higher bits contains the partition. Keep in mind that the value of id has no meaning and it's only chosen for performance reasons. IOW, a row can be referred by an oid of a partition different than the 16 higher bits of id: - there's no migration needed and the 16 higher bits of all existing rows are 0 - in case of deduplication, a row can still be shared by different partitions Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement on existing databases. ## Downsides On insertion, increasing the number of partitions now slows down significantly: for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12 partitions, the difference remains negligible. The solution for this issue will be to enable to increase the number of partitions efficiently, so that nodes can keep a small number of them, even for DB that are expected to grow so much that many nodes are added over time: such feature was already considered so that users don't have to worry anymore about this obscure setting at database creation. Read performance is only slowed down for applications that read a lot of data that were written contiguously, but split in small blocks. A solution is to extend ZODB so that the application tells it to chose new oids that will end up in the same partition. Like for insertion, there should not be too many partitions. With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to collect all last ids at startup when there are many partitions. ## Other advantages - The storage layout of data is now always the same and does not depend on whether rows came from replication or commits. - Efficient deletion of partition to free space in-place will be possible. # Considered alternative The only serious alternative was to replicate as many partitions as possible at the same time, ideally all assigned partitions, but it's not always possible. For best performance, it would often require to synchronize new nodes, or even all of them, so that thesource nodes don't have to scan 'data' several times. If existing nodes are kept, all data that aren't copied to the newly added nodes have to be skipped. If the number of nodes is multiplied by N, the efficiency is 1-1/N at best (synchronized nodes), else it's even worse because partitions are somehow shuffled. Checking/replacing a single node would remain slow when there are several source nodes. At last, such an algorithm would be much more complex and we would not have the other advantages listed above.
-
- 13 Dec, 2017 1 commit
-
-
Julien Muchembled authored
-
- 05 Dec, 2017 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 12 Jun, 2017 1 commit
-
-
Julien Muchembled authored
-
- 12 May, 2017 1 commit
-
-
Julien Muchembled authored
Since it's not worth anymore to keep track of the last connection activity (which, btw, ignored TCP ACKs, i.e. timeouts could theorically be triggered before all the data were actually sent), the semantics of closeClient has also changed. Before this commit, the 1-minute timeout was reset whenever there was activity (connection still used as server). Now, it happens exactly 100 seconds after the connection is not used anymore as client.
-
- 25 Apr, 2017 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 24 Apr, 2017 3 commits
-
-
Julien Muchembled authored
The election is not a separate process anymore. It happens during the RECOVERING phase, and there's no use of timeouts anymore. Each master node keeps a timestamp of when it started to play the primary role, and the node with the smallest timestamp is elected. The election stops when the cluster is started: as long as it is operational, the primary master can't be deposed. An election must happen whenever the cluster is not operational anymore, to handle the case of a network cut between a primary master and all other nodes: then another master node (secondary) takes over and when the initial primary master is back, it loses against the new primary master if the cluster is already started.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 31 Mar, 2017 2 commits
-
-
Julien Muchembled authored
This is a follow up of commit 64afd7d2, which focused on read accesses when there is no transaction activity. This commit also includes a test to check a simpler scenario that the one described in the previous commit.
-
Julien Muchembled authored
This revert commit bddc1802, to fix the following storage crash: Traceback (most recent call last): ... File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/storage/handlers/master.py", line 44, in notifyPartitionChanges app.pt.update(ptid, cell_list, app.nm) File "neo/lib/pt.py", line 231, in update assert node is not None, 'No node found for uuid ' + uuid_str(uuid) AssertionError: No node found for uuid S3 Partitition table updates must also be processed with InitializationHandler when nodes remain in PENDING state because they're not added to the cluster.
-
- 23 Mar, 2017 4 commits
-
-
Julien Muchembled authored
In the worst case, with many clients trying to lock the same oids, the cluster could enter in an infinite cascade of deadlocks. Here is an overview with 3 storage nodes and 3 transactions: S1 S2 S3 order of locking tids # abbreviations: l1 l1 l2 123 # l: lock q23 q23 d1q3 231 # d: deadlock triggered r1:l3 r1:l2 (r1) # for S3, we still have l2 # q: queued d2q1 q13 q13 312 # r: rebase Above, we show what happens when a random transaction gets a lock just after that another is rebased. Here, the result is that the last 2 lines are a permutation of the first 2, and this can repeat indefinitely with bad luck. This commit reduces the probability of deadlock by processing delayed stores/checks in the order of their locking tid. In the above example, S1 would give the lock to 2 when 1 is rebased, and 2 would vote successfully.
-
Julien Muchembled authored
This fixes a bug that could to data corruption or crashes.
-
Julien Muchembled authored
It becomes possible to answer with several packets: - the last is the usual associated answer packet - all other (previously sent) packets are notifications Connection.send does not return the packet id anymore. This is not useful enough, and the caller can inspect the sent packet (getId).
-
Julien Muchembled authored
-
- 18 Mar, 2017 1 commit
-
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/master/handlers/client.py", line 70, in askFinishTransaction conn.getPeerId(), File "neo/master/transactions.py", line 387, in prepare assert node_list, (ready, failed) AssertionError: (set([]), frozenset([])) Master log leading to the crash: PACKET #0x0009 StartOperation > S1 PACKET #0x0004 BeginTransaction < C1 DEBUG Begin <...> PACKET #0x0004 AnswerBeginTransaction > C1 PACKET #0x0001 NotifyReady < S1 It was wrong to process BeginTransaction before receiving NotifyReady. The changes in the storage are cosmetics: the 'ready' attribute has become redundant with 'operational'.
-
- 02 Mar, 2017 1 commit
-
-
Julien Muchembled authored
This is done by moving self.replicator.populate() after the switch to MasterOperationHandler, so that the latter is not delayed. This change comes with some refactoring of the main loop, to clean up app.checker and app.replicator properly (like app.tm). Another option could have been to process notifications with the last handler, instead of the first one. But if possible, cleaning up the whole code to not delay handlers anymore looks the best option.
-
- 27 Feb, 2017 1 commit
-
-
Julien Muchembled authored
This happened in 2 cases: - Commit a4c06242 ("Review aborting of transactions") introduced a race condition causing oids to remain write-locked forever after that the transaction modifying them is aborted. - An unfinished transaction is not locked/unlocked during tpc_finish: oids must be unlocked when being notified that the transaction is finished.
-
- 21 Feb, 2017 3 commits
-
-
Julien Muchembled authored
This is a first version with several optimizations possible: - improve EventQueue (or implement a specific queue) to minimize deadlocks - turn the RebaseObject packet into a notification Sorting oids could also be useful to reduce the probability of deadlocks, but that would never be enough to avoid them completely, even if there's a single storage. For example: 1. C1 does a first store (x or y) 2. C2 stores x and y; one is delayed 3. C1 stores the other -> deadlock When solving the deadlock, the data of the first store may only exist on the storage. 2 functional tests are removed because they're redundant, either with ZODB tests or with the new threaded tests.
-
Julien Muchembled authored
- Make sure that errors while processing a delayed packet are reported to the connection that sent this packet. - Provide a mechanism to process events for the same connection in chronological order.
-
Julien Muchembled authored
-
- 14 Feb, 2017 3 commits
-
-
Julien Muchembled authored
Fix conflict handling after a successful store to a node being disconnected for having missed a transaction
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 02 Feb, 2017 1 commit
-
-
Julien Muchembled authored
Now that we do inequality comparisons between timestamps, the master must use a monotonic clock, to avoid issues when the clock is turned back. Before, the probability that time.time() returned again the same value was probably negligible.
-