- 23 Mar, 2017 9 commits
-
-
Julien Muchembled authored
In the worst case, with many clients trying to lock the same oids, the cluster could enter in an infinite cascade of deadlocks. Here is an overview with 3 storage nodes and 3 transactions: S1 S2 S3 order of locking tids # abbreviations: l1 l1 l2 123 # l: lock q23 q23 d1q3 231 # d: deadlock triggered r1:l3 r1:l2 (r1) # for S3, we still have l2 # q: queued d2q1 q13 q13 312 # r: rebase Above, we show what happens when a random transaction gets a lock just after that another is rebased. Here, the result is that the last 2 lines are a permutation of the first 2, and this can repeat indefinitely with bad luck. This commit reduces the probability of deadlock by processing delayed stores/checks in the order of their locking tid. In the above example, S1 would give the lock to 2 when 1 is rebased, and 2 would vote successfully.
-
Julien Muchembled authored
-
Julien Muchembled authored
This fixes a bug that could to data corruption or crashes.
-
Julien Muchembled authored
It becomes possible to answer with several packets: - the last is the usual associated answer packet - all other (previously sent) packets are notifications Connection.send does not return the packet id anymore. This is not useful enough, and the caller can inspect the sent packet (getId).
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 22 Mar, 2017 1 commit
-
-
Julien Muchembled authored
In reality, this was tested with taskset 1 neotestrunner ...
-
- 21 Mar, 2017 1 commit
-
-
Julien Muchembled authored
-
- 20 Mar, 2017 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 18 Mar, 2017 1 commit
-
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/master/handlers/client.py", line 70, in askFinishTransaction conn.getPeerId(), File "neo/master/transactions.py", line 387, in prepare assert node_list, (ready, failed) AssertionError: (set([]), frozenset([])) Master log leading to the crash: PACKET #0x0009 StartOperation > S1 PACKET #0x0004 BeginTransaction < C1 DEBUG Begin <...> PACKET #0x0004 AnswerBeginTransaction > C1 PACKET #0x0001 NotifyReady < S1 It was wrong to process BeginTransaction before receiving NotifyReady. The changes in the storage are cosmetics: the 'ready' attribute has become redundant with 'operational'.
-
- 17 Mar, 2017 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
Due to a bug in MariaDB Connector/C 2.3.2, some tests like testBasicStore and test_max_allowed_packet were retrying the same failing query indefinitely.
-
Julien Muchembled authored
-
- 14 Mar, 2017 4 commits
-
-
Julien Muchembled authored
On clusters with many deadlock avoidances, this flooded logs. Hopefully, this commit reduces the size of logs without losing information.
-
Julien Muchembled authored
An issue that happened for the first time on a storage node didn't always cause other nodes to flush their logs, which made debugging difficult.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 07 Mar, 2017 1 commit
-
-
Julien Muchembled authored
-
- 03 Mar, 2017 1 commit
-
-
Julien Muchembled authored
Generators are not thread-safe: Exception in thread T2: Traceback (most recent call last): ... File "ZODB/tests/StorageTestBase.py", line 157, in _dostore r2 = self._storage.tpc_vote(t) File "neo/client/Storage.py", line 95, in tpc_vote return self.app.tpc_vote(transaction) File "neo/client/app.py", line 507, in tpc_vote self.waitStoreResponses(txn_context) File "neo/client/app.py", line 500, in waitStoreResponses _waitAnyTransactionMessage(txn_context) File "neo/client/app.py", line 145, in _waitAnyTransactionMessage self._waitAnyMessage(queue, block=block) File "neo/client/app.py", line 128, in _waitAnyMessage conn, packet, kw = get(block) File "neo/lib/locking.py", line 203, in get self._lock() File "neo/tests/threaded/__init__.py", line 590, in _lock for i in TIC_LOOP: ValueError: generator already executing ====================================================================== FAIL: check_checkCurrentSerialInTransaction (neo.tests.zodb.testBasic.BasicTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "neo/tests/zodb/testBasic.py", line 33, in check_checkCurrentSerialInTransaction super(BasicTests, self).check_checkCurrentSerialInTransaction() File "ZODB/tests/BasicStorage.py", line 294, in check_checkCurrentSerialInTransaction utils.load_current(self._storage, b'\0\0\0\0\0\0\0\xf4')[1]) failureException: False is not true
-
- 02 Mar, 2017 2 commits
-
-
Julien Muchembled authored
This is done by moving self.replicator.populate() after the switch to MasterOperationHandler, so that the latter is not delayed. This change comes with some refactoring of the main loop, to clean up app.checker and app.replicator properly (like app.tm). Another option could have been to process notifications with the last handler, instead of the first one. But if possible, cleaning up the whole code to not delay handlers anymore looks the best option.
-
Julien Muchembled authored
-
- 27 Feb, 2017 3 commits
-
-
Julien Muchembled authored
This happened in 2 cases: - Commit a4c06242 ("Review aborting of transactions") introduced a race condition causing oids to remain write-locked forever after that the transaction modifying them is aborted. - An unfinished transaction is not locked/unlocked during tpc_finish: oids must be unlocked when being notified that the transaction is finished.
-
Julien Muchembled authored
This was found by the first assertion of answerRebaseObject (client) because a storage node missed a few transactions and reported a conflict with an older serial than the one being stored: this must never happen and this commit adds a more generic assertion on the storage side. The above case is when the "first phase" of replication of a partition (all history up to the tid before unfinished transactions) ended after that the unfinished transactions are finished: this was a corruption bug, where UP_TO_DATE cells could miss data. Otherwise, if the "first phase" ended before, then the partition remained stuck in OUT_OF_DATE state. Restarting the storage node was enough to recover.
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/client/app.py", line 507, in tpc_vote self.waitStoreResponses(txn_context) File "neo/client/app.py", line 500, in waitStoreResponses _waitAnyTransactionMessage(txn_context) File "neo/client/app.py", line 150, in _waitAnyTransactionMessage self._handleConflicts(txn_context) File "neo/client/app.py", line 474, in _handleConflicts self._store(txn_context, oid, conflict_serial, data) File "neo/client/app.py", line 410, in _store self._waitAnyTransactionMessage(txn_context, False) File "neo/client/app.py", line 145, in _waitAnyTransactionMessage self._waitAnyMessage(queue, block=block) File "neo/client/app.py", line 133, in _waitAnyMessage _handlePacket(conn, packet, kw) File "neo/lib/threaded_app.py", line 133, in _handlePacket handler.dispatch(conn, packet, kw) File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/client/handlers/storage.py", line 122, in answerRebaseObject assert txn_context.conflict_dict[oid] == (serial, conflict) AssertionError Scenario: 0. unanswered rebase from S2 1. conflict resolved between t1 and t2 -> S1 & S2 2. S1 reports a new conflict 3. S2 answers to the rebase: returned serial (t1) is smaller than in conflict_dict (t2) 4. S2 reports the same conflict as in 2
-
- 24 Feb, 2017 2 commits
-
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/storage/handlers/storage.py", line 111, in answerFetchObjects self.app.replicator.finish() File "neo/storage/replicator.py", line 370, in finish self._nextPartition() File "neo/storage/replicator.py", line 279, in _nextPartition assert app.pt.getCell(offset, app.uuid).isOutOfDate() AssertionError The scenario is: 1. partition A: start of replication, with unfinished transactions 2. partition A: all unfinished transactions are finished 3. partition A: end of replication with ReplicationDone notification 4. replication of partition B 5. partition A: AssertionError when starting replication The bug is that in 3, the partition A is partially replicated and the storage node must not notify the master.
-
Julien Muchembled authored
-
- 23 Feb, 2017 1 commit
-
-
Julien Muchembled authored
This fixes testBasicStore when run with MySQL backend, which started to fail with commit 9eb06ff1 when -L runner option is not used.
-
- 21 Feb, 2017 6 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
This is a first version with several optimizations possible: - improve EventQueue (or implement a specific queue) to minimize deadlocks - turn the RebaseObject packet into a notification Sorting oids could also be useful to reduce the probability of deadlocks, but that would never be enough to avoid them completely, even if there's a single storage. For example: 1. C1 does a first store (x or y) 2. C2 stores x and y; one is delayed 3. C1 stores the other -> deadlock When solving the deadlock, the data of the first store may only exist on the storage. 2 functional tests are removed because they're redundant, either with ZODB tests or with the new threaded tests.
-
Julien Muchembled authored
- Make sure that errors while processing a delayed packet are reported to the connection that sent this packet. - Provide a mechanism to process events for the same connection in chronological order.
-
Julien Muchembled authored
-
- 14 Feb, 2017 3 commits
-
-
Julien Muchembled authored
Fix conflict handling after a successful store to a node being disconnected for having missed a transaction
-
Julien Muchembled authored
-
Julien Muchembled authored
-