Commits · e1299714bcf64161e1cb08786f6b39e9496ed38b · nexedi / neoppod

31 Dec, 2018 5 commits
- wip · e1299714
  Julien Muchembled authored May 07, 2018
  
  e1299714
- neolog: do not die when a table is corrupted · 49e7d17f
  Julien Muchembled authored Dec 20, 2018
  
  49e7d17f
- neolog: add support for zstd-compressed logs · ad379295
  Julien Muchembled authored Dec 23, 2018
  
  ad379295
- neolog: do not hardcode default value of -L option in help message · 4a96c8b6
  Julien Muchembled authored Dec 07, 2018
  
  4a96c8b6
- fixup! New log format to show node id (and optionally cluster name) in node column · af53946c
  Julien Muchembled authored Dec 23, 2018
```
Commit aa4d621d broke log rotation
and neolog sometimes failed to read in new format.
```
  af53946c
05 Dec, 2018 1 commit
- New log format to show node id (and optionally cluster name) in node column · aa4d621d
  Julien Muchembled authored Nov 25, 2018
```
neolog has new options: -N for old behaviour, and -C to show the cluster name.
```
  aa4d621d
21 Nov, 2018 4 commits

fixup! client: discard late answers to lockless writes · 8ef1ddba
Julien Muchembled authored Nov 09, 2018
```
Since commit 50e7fe52,
some code can be simplified.
```
8ef1ddba

client: fix race condition between Storage.load() and invalidations · a2e278d5

Julien Muchembled authored Nov 19, 2018

This fixes a bug that could manifest as follows:

Traceback (most recent call last):
File "neo/client/app.py", line 432, in load
self._cache.store(oid, data, tid, next_tid)
File "neo/client/cache.py", line 223, in store
assert item.tid == tid, (item, tid)
AssertionError: (<CacheItem oid='\x00\x00\x00\x00\x00\x00\x00\x01' tid='\x03\xcb\xc6\xca\xfd\xc7\xda\xee' next_tid='\x03\xcb\xc6\xca\xfd\xd8\t\x88' data='...' counter=1 level=1 expire=10000 prev=<...> next=<...>>, '\x03\xcb\xc6\xca\xfd\xd8\t\x88')

The big changes in the threaded test framework are required because we need to
reproduce a race condition between client threads and this conflicts with the
serialization of epoll events (deadlock).

a2e278d5

client: fix race condition in refcounting dispatched answer packets · 743026d5

Julien Muchembled authored Nov 16, 2018

This was found when stress-testing a big cluster. 1 client node was stuck:

  (Pdb) pp app.dispatcher.__dict__
  {'lock_acquire': <built-in method acquire of thread.lock object at 0x7f788c6e4250>,
  'lock_release': <built-in method release of thread.lock object at 0x7f788c6e4250>,
  'message_table': {140155667614608: {},
                    140155668875280: {},
                    140155671145872: {},
                    140155672381008: {},
                    140155672381136: {},
                    140155672381456: {},
                    140155673002448: {},
                    140155673449680: {},
                    140155676093648: {170: <neo.lib.locking.SimpleQueue object at 0x7f788a109c58>},
                    140155677536464: {},
                    140155679224336: {},
                    140155679876496: {},
                    140155680702992: {},
                    140155681851920: {},
                    140155681852624: {},
                    140155682773584: {},
                    140155685988880: {},
                    140155693061328: {},
                    140155693062224: {},
                    140155693074960: {},
                    140155696334736: {278: <neo.lib.locking.SimpleQueue object at 0x7f788a109c58>},
                    140155696411408: {},
                    140155696414160: {},
                    140155696576208: {},
                    140155722373904: {}},
  'queue_dict': {140155673622936: 1, 140155689147480: 2}}

140155673622936 should not be queue_dict

743026d5

More RTMIN+2 (log) information for clients and connections · 7e456329
Julien Muchembled authored Nov 14, 2018

7e456329

15 Nov, 2018 3 commits
- storage: check for conflicts when notifying that the a partition is replicated · d66b4f24
  Julien Muchembled authored Nov 06, 2018
  
  d66b4f24
- storage: clarify several assertions · f25b8ee3
  Julien Muchembled authored Nov 07, 2018
  
  f25b8ee3
- qa: new expectedFailure testcase method · 4150ffb1
  Julien Muchembled authored Nov 07, 2018
```
The idea is to write:

  with self.expectedFailure(...): \

just before the statement that is expected to fail. Contrary to the existing
decorator, we want to:
- be sure that the test fails at the expected line;
- be able to remove an expectedFailure without touching the code around.
```
  4150ffb1
08 Nov, 2018 15 commits

client: merge ConnectionPool inside Application · 7494de84
Julien Muchembled authored Oct 17, 2018

7494de84
client: prepare merge of ConnectionPool inside Application · 693aaf79
Julien Muchembled authored Nov 08, 2018

693aaf79

client: fix AssertionError when trying to reconnect too quickly after an error · 305dda86

Julien Muchembled authored Oct 17, 2018

When ConnectionPool._initNodeConnection fails a first time with:

  StorageError: protocol error: already connected

the following assertion failure happens when trying to reconnect before the
previous connection is actually closed (currently, only the node sending an
error message closes the connection, as commented in EventHandler):

  Traceback (most recent call last):
    File "neo/client/Storage.py", line 82, in load
      return self.app.load(oid)[:2]
    File "neo/client/app.py", line 367, in load
      data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
    File "neo/client/app.py", line 399, in _loadFromStorage
      askStorage)
    File "neo/client/app.py", line 293, in _askStorageForRead
      conn = cp.getConnForNode(node)
    File "neo/client/pool.py", line 98, in getConnForNode
      conn = self._initNodeConnection(node)
    File "neo/client/pool.py", line 48, in _initNodeConnection
      dispatcher=app.dispatcher)
    File "neo/lib/connection.py", line 704, in __init__
      super(MTClientConnection, self).__init__(*args, **kwargs)
    File "neo/lib/connection.py", line 602, in __init__
      node.setConnection(self)
    File "neo/lib/node.py", line 122, in setConnection
      attributeTracker.whoSet(self, '_connection'))
  AssertionError

305dda86

qa: fix attributeTracker · 163858ed
Julien Muchembled authored Oct 17, 2018

163858ed
storage: fix storage leak when an oid is stored several times within a transaction · fa14157b
Julien Muchembled authored Oct 15, 2018

fa14157b

client: discard late answers to lockless writes · 50e7fe52

Julien Muchembled authored Oct 09, 2018

This fixes:

  Traceback (most recent call last):
    File "neo/client/Storage.py", line 108, in tpc_vote
      return self.app.tpc_vote(transaction)
    File "neo/client/app.py", line 546, in tpc_vote
      self.waitStoreResponses(txn_context)
    File "neo/client/app.py", line 539, in waitStoreResponses
      _waitAnyTransactionMessage(txn_context)
    File "neo/client/app.py", line 160, in _waitAnyTransactionMessage
      self._handleConflicts(txn_context)
    File "neo/client/app.py", line 514, in _handleConflicts
      self._store(txn_context, oid, serial, data)
    File "neo/client/app.py", line 452, in _store
      self._waitAnyTransactionMessage(txn_context, False)
    File "neo/client/app.py", line 155, in _waitAnyTransactionMessage
      self._waitAnyMessage(queue, block=block)
    File "neo/client/app.py", line 142, in _waitAnyMessage
      _handlePacket(conn, packet, kw)
    File "neo/lib/threaded_app.py", line 133, in _handlePacket
      handler.dispatch(conn, packet, kw)
    File "neo/lib/handler.py", line 72, in dispatch
      method(conn, *args, **kw)
    File "neo/client/handlers/storage.py", line 143, in answerRebaseObject
      assert cached == data
  AssertionError

50e7fe52

qa: in threaded tests, log queued packets when "tic is looping forever" · 82672031
Julien Muchembled authored Oct 15, 2018

82672031
In logs, dump the partition table in a more compact and readable way · 323fd636
Julien Muchembled authored Oct 05, 2018

323fd636

storage: fix write-locking bug when a deadlock happens at the end of a replication · 7fff11f6

Julien Muchembled authored Oct 05, 2018

During rebase, writes could stay lockless although the partition was
replicated. Another transaction could then take locks prematurely, leading to
the following crash:

  Traceback (most recent call last):
    File "neo/lib/handler.py", line 72, in dispatch
      method(conn, *args, **kw)
    File "neo/storage/handlers/master.py", line 36, in notifyUnlockInformation
      self.app.tm.unlock(ttid)
    File "neo/storage/transactions.py", line 329, in unlock
      self.abort(ttid, even_if_locked=True)
    File "neo/storage/transactions.py", line 573, in abort
      not self._replicated.get(self.getPartition(oid))), x
  AssertionError: ('\x00\x00\x00\x00\x00\x03\x03v', '\x03\xca\xb44J\x13\x99\x88', '\x03\xca\xb44J\xe0\xdcU', {}, set(['\x00\x00\x00\x00\x00\x03\x03v']))

7fff11f6

client: log_flush most exceptions raised from Application to ZODB · efaae043
Julien Muchembled authored Oct 03, 2018
```
Flushing logs will help fixing NEO bugs (e.g. failed assertions).
```
efaae043

client: fix assertion failure in case of conflict + storage disconnection · a746f812

Julien Muchembled authored Oct 02, 2018

This fixes:

  Traceback (innermost last):
    ...
    Module transaction._transaction, line 393, in _commitResources
      rm.tpc_vote(self)
    Module ZODB.Connection, line 797, in tpc_vote
      s = vote(transaction)
    Module neo.client.Storage, line 95, in tpc_vote
      return self.app.tpc_vote(transaction)
    Module neo.client.app, line 546, in tpc_vote
      self.waitStoreResponses(txn_context)
    Module neo.client.app, line 539, in waitStoreResponses
      _waitAnyTransactionMessage(txn_context)
    Module neo.client.app, line 160, in _waitAnyTransactionMessage
      self._handleConflicts(txn_context)
    Module neo.client.app, line 471, in _handleConflicts
      assert oid is None, (oid, serial)
  AssertionError: ('\x00\x00\x00\x00\x00\x02\n\xe3', '\x03\xca\xad\xcb!\x92\xb6\x9c')

a746f812

client: simplify connection management in transaction contexts · 2851a274
Julien Muchembled authored Oct 01, 2018
```
With previous commit, there's no point anymore to distinguish storage nodes
for which we only check serials.
```
2851a274

client: also vote to nodes that only check serials · ab435b28

Julien Muchembled authored Oct 01, 2018

Not doing so was an incorrect optimization. Checking serials does take
write-locks and they must not be released when a client-storage connection
breaks between vote and lock, otherwise a concurrent transaction modifying such
serials may finish before.

ab435b28

qa: deindent code · d7245ee9
Julien Muchembled authored Sep 28, 2018

d7245ee9
Bump protocol version · 9a5b46dd
Julien Muchembled authored Oct 02, 2018

9a5b46dd

07 Nov, 2018 8 commits

client: fix undetected disconnections to storage nodes during commit · d68e9053

Julien Muchembled authored Sep 18, 2018

When a client-storage connection breaks, the storage node discards data of all
ongoing transactions by the client. Therefore, a reconnection within the
context of the transaction is wrong, as it could lead to partially-written
transactions.

This fixes cases where such reconnection happened. The biggest issue was that
the mechanism to dispatch disconnection events only works when waiting for an
answer.

The client can still reconnect for other purposes but the new connection won't
be reused by transactions that already involved the storage node.

d68e9053

Fix data corruption due to undetected conflicts after storage failures · 854a4920

Julien Muchembled authored Sep 24, 2018

Without this new mechanism to detect oids that aren't write-locked,
a transaction could be committed successfully without detecting conflicts.
In the added test, the resulting value was 2, whereas it should be 5 if there
was no node failure.

854a4920

master: notify replicating nodes of aborted watched transactions · 59698faa
Julien Muchembled authored Sep 18, 2018
```
This fixes stuck replication when a client loses connection to the master
during a commit.
```
59698faa
New neoctl command to flush the logs of all nodes in the cluster · 64826794
Julien Muchembled authored Sep 19, 2018

64826794
storage: fix premature write-locking during rebase when replication ends · edf58ece
Julien Muchembled authored Sep 20, 2018

edf58ece

client: fix race condition when a storage connection is closed just after identification · bf6569d6

Julien Muchembled authored Sep 20, 2018

The consequence was that the client never reconnected to that storage node.
On commits, writes to that node always failed, causing the master to
disconnect it.

bf6569d6

storage: relax assertion · 21a61977

Julien Muchembled authored Sep 19, 2018

Nothing wrong actually happens.

Traceback (most recent call last):
  File "neo/scripts/neostorage.py", line 32, in main
    app.run()
  File "neo/storage/app.py", line 194, in run
    self._run()
  File "neo/storage/app.py", line 225, in _run
    self.doOperation()
  File "neo/storage/app.py", line 310, in doOperation
    poll()
  File "neo/storage/app.py", line 134, in _poll
    self.em.poll(1)
  File "neo/lib/event.py", line 168, in poll
    self._poll(0)
  File "neo/lib/event.py", line 220, in _poll
    if conn.readable():
  File "neo/lib/connection.py", line 483, in readable
    self._closure()
  File "neo/lib/connection.py", line 541, in _closure
    self.close()
  File "neo/lib/connection.py", line 533, in close
    handler.connectionClosed(self)
  File "neo/storage/handlers/client.py", line 46, in connectionClosed
    app.tm.abortFor(conn.getUUID())
  File "neo/storage/transactions.py", line 594, in abortFor
    self.abort(ttid)
  File "neo/storage/transactions.py", line 570, in abort
    self._replicated.get(self.getPartition(oid))), x
AssertionError: ('\x00\x00\x00\x00\x00\x01a\xe5', '\x03\xcaZ\x04\x14o\x8e\xbb', '\x03\xcaZ\x04\x0eX{\xbb', {1: None, 21: '\x03\xcaZ\x04\x11\xc6\x94\xf6'}, set([]))

21a61977

comments, unused import · 1551c4a9
Julien Muchembled authored Sep 11, 2018

1551c4a9

05 Nov, 2018 2 commits

storage: fix write-lock leak · ce42103a
Julien Muchembled authored Sep 11, 2018

ce42103a

client: fix possible corruption in case of network failure with a storage · 04ae2fc0

Julien Muchembled authored Sep 07, 2018

In case of storage disconnection, one packet (VoteTransaction) was not handled
the same way as other writes, and the failure was not reported to the master
so that it arbitrates the vote. The transaction was therefore partially
committed.

04ae2fc0

06 Sep, 2018 2 commits

qa: comment about potential freeze when a functional test ends · 61e4ffa1
Julien Muchembled authored Sep 06, 2018

61e4ffa1

storage: fix assertion failure in case of connection reset with a client node · 652f1f0d

Julien Muchembled authored Sep 05, 2018

Here is what happened after simulating a network failure between a client and
a storage:

DEBUG recv failed for <SSLSocketConnectorIPv6 at 0x7f8198027f90 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 53970), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECONNRESET (Connection reset by peer)
DEBUG connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f81939a0950>
DEBUG connection started for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, connecting, client) at 7f8192eb17d0>
PACKET #0x0000 RequestIdentification > S2 ([xxxx:xxxx:60:4c2c::25c3]:39085) | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761)
DEBUG SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECDHE-RSA-AES256-GCM-SHA384 256
DEBUG connection completed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, client) at 7f8192eb17d0> (from xxxx:xxxx:120:cd8::90a1:54014)
DEBUG <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)> closed in recv
DEBUG connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f8192eb17d0>
ERROR Connection to <StorageNode(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, state=RUNNING, connection=None, not identified) at 7f81a8874690> failed

DEBUG accepted a connection from xxxx:xxxx:120:cd8::90a1:54014
DEBUG SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f657144a910 fileno 22 ('xxxx:xxxx:60:4c2c::25c3', 39085), opened from ('xxxx:xxxx:120:cd8::90a1', 54014)>: ECDHE-RSA-AES256-GCM-SHA384 256
DEBUG connection completed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, fd=22, server) at 7f657144a090> (from xxxx:xxxx:60:4c2c::25c3:39085)
PACKET #0x0000 RequestIdentification < None ([xxxx:xxxx:120:cd8::90a1]:54014) | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761)
DEBUG connection closed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, closed, server) at 7f657144a090>
WARNING A connection was lost during identification
ERROR Pre-mortem data:
ERROR Traceback (most recent call last):
ERROR File "neo/storage/app.py", line 194, in run
ERROR self._run()
ERROR File "neo/storage/app.py", line 225, in _run
ERROR self.doOperation()
ERROR File "neo/storage/app.py", line 310, in doOperation
ERROR poll()
ERROR File "neo/storage/app.py", line 134, in _poll
ERROR self.em.poll(1)
ERROR File "neo/lib/event.py", line 160, in poll
ERROR to_process.process()
ERROR File "neo/lib/connection.py", line 499, in process
ERROR self._handlers.handle(self, self._queue.pop(0))
ERROR File "neo/lib/connection.py", line 85, in handle
ERROR self._handle(connection, packet)
ERROR File "neo/lib/connection.py", line 100, in _handle
ERROR pending[0][1].packetReceived(connection, packet)
ERROR File "neo/lib/handler.py", line 123, in packetReceived
ERROR self.dispatch(*args)
ERROR File "neo/lib/handler.py", line 72, in dispatch
ERROR method(conn, *args, **kw)
ERROR File "neo/storage/handlers/identification.py", line 56, in requestIdentification
ERROR assert not node.isConnected(), node
ERROR AssertionError: <ClientNode(uuid=C8, state=RUNNING, connection=<ServerConnection(uuid=C8, address=[xxxx:xxxx:120:cd8::90a1]:53970, handler=ClientOperationHandler, fd=18, on_close=onConnectionClosed, server) at 7f657147d7d0>) at 7f65714d6cd0>

652f1f0d