1. 31 Dec, 2018 5 commits
  2. 05 Dec, 2018 1 commit
  3. 21 Nov, 2018 4 commits
    • Julien Muchembled's avatar
      fixup! client: discard late answers to lockless writes · 8ef1ddba
      Julien Muchembled authored
      Since commit 50e7fe52,
      some code can be simplified.
      8ef1ddba
    • Julien Muchembled's avatar
      client: fix race condition between Storage.load() and invalidations · a2e278d5
      Julien Muchembled authored
      This fixes a bug that could manifest as follows:
      
        Traceback (most recent call last):
          File "neo/client/app.py", line 432, in load
            self._cache.store(oid, data, tid, next_tid)
          File "neo/client/cache.py", line 223, in store
            assert item.tid == tid, (item, tid)
        AssertionError: (<CacheItem oid='\x00\x00\x00\x00\x00\x00\x00\x01' tid='\x03\xcb\xc6\xca\xfd\xc7\xda\xee' next_tid='\x03\xcb\xc6\xca\xfd\xd8\t\x88' data='...' counter=1 level=1 expire=10000 prev=<...> next=<...>>, '\x03\xcb\xc6\xca\xfd\xd8\t\x88')
      
      The big changes in the threaded test framework are required because we need to
      reproduce a race condition between client threads and this conflicts with the
      serialization of epoll events (deadlock).
      a2e278d5
    • Julien Muchembled's avatar
      client: fix race condition in refcounting dispatched answer packets · 743026d5
      Julien Muchembled authored
      This was found when stress-testing a big cluster. 1 client node was stuck:
      
        (Pdb) pp app.dispatcher.__dict__
        {'lock_acquire': <built-in method acquire of thread.lock object at 0x7f788c6e4250>,
        'lock_release': <built-in method release of thread.lock object at 0x7f788c6e4250>,
        'message_table': {140155667614608: {},
                          140155668875280: {},
                          140155671145872: {},
                          140155672381008: {},
                          140155672381136: {},
                          140155672381456: {},
                          140155673002448: {},
                          140155673449680: {},
                          140155676093648: {170: <neo.lib.locking.SimpleQueue object at 0x7f788a109c58>},
                          140155677536464: {},
                          140155679224336: {},
                          140155679876496: {},
                          140155680702992: {},
                          140155681851920: {},
                          140155681852624: {},
                          140155682773584: {},
                          140155685988880: {},
                          140155693061328: {},
                          140155693062224: {},
                          140155693074960: {},
                          140155696334736: {278: <neo.lib.locking.SimpleQueue object at 0x7f788a109c58>},
                          140155696411408: {},
                          140155696414160: {},
                          140155696576208: {},
                          140155722373904: {}},
        'queue_dict': {140155673622936: 1, 140155689147480: 2}}
      
      140155673622936 should not be queue_dict
      743026d5
    • Julien Muchembled's avatar
  4. 15 Nov, 2018 3 commits
  5. 08 Nov, 2018 15 commits
    • Julien Muchembled's avatar
      7494de84
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      client: fix AssertionError when trying to reconnect too quickly after an error · 305dda86
      Julien Muchembled authored
      When ConnectionPool._initNodeConnection fails a first time with:
      
        StorageError: protocol error: already connected
      
      the following assertion failure happens when trying to reconnect before the
      previous connection is actually closed (currently, only the node sending an
      error message closes the connection, as commented in EventHandler):
      
        Traceback (most recent call last):
          File "neo/client/Storage.py", line 82, in load
            return self.app.load(oid)[:2]
          File "neo/client/app.py", line 367, in load
            data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
          File "neo/client/app.py", line 399, in _loadFromStorage
            askStorage)
          File "neo/client/app.py", line 293, in _askStorageForRead
            conn = cp.getConnForNode(node)
          File "neo/client/pool.py", line 98, in getConnForNode
            conn = self._initNodeConnection(node)
          File "neo/client/pool.py", line 48, in _initNodeConnection
            dispatcher=app.dispatcher)
          File "neo/lib/connection.py", line 704, in __init__
            super(MTClientConnection, self).__init__(*args, **kwargs)
          File "neo/lib/connection.py", line 602, in __init__
            node.setConnection(self)
          File "neo/lib/node.py", line 122, in setConnection
            attributeTracker.whoSet(self, '_connection'))
        AssertionError
      305dda86
    • Julien Muchembled's avatar
      qa: fix attributeTracker · 163858ed
      Julien Muchembled authored
      163858ed
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      client: discard late answers to lockless writes · 50e7fe52
      Julien Muchembled authored
      This fixes:
      
        Traceback (most recent call last):
          File "neo/client/Storage.py", line 108, in tpc_vote
            return self.app.tpc_vote(transaction)
          File "neo/client/app.py", line 546, in tpc_vote
            self.waitStoreResponses(txn_context)
          File "neo/client/app.py", line 539, in waitStoreResponses
            _waitAnyTransactionMessage(txn_context)
          File "neo/client/app.py", line 160, in _waitAnyTransactionMessage
            self._handleConflicts(txn_context)
          File "neo/client/app.py", line 514, in _handleConflicts
            self._store(txn_context, oid, serial, data)
          File "neo/client/app.py", line 452, in _store
            self._waitAnyTransactionMessage(txn_context, False)
          File "neo/client/app.py", line 155, in _waitAnyTransactionMessage
            self._waitAnyMessage(queue, block=block)
          File "neo/client/app.py", line 142, in _waitAnyMessage
            _handlePacket(conn, packet, kw)
          File "neo/lib/threaded_app.py", line 133, in _handlePacket
            handler.dispatch(conn, packet, kw)
          File "neo/lib/handler.py", line 72, in dispatch
            method(conn, *args, **kw)
          File "neo/client/handlers/storage.py", line 143, in answerRebaseObject
            assert cached == data
        AssertionError
      50e7fe52
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      storage: fix write-locking bug when a deadlock happens at the end of a replication · 7fff11f6
      Julien Muchembled authored
      During rebase, writes could stay lockless although the partition was
      replicated. Another transaction could then take locks prematurely, leading to
      the following crash:
      
        Traceback (most recent call last):
          File "neo/lib/handler.py", line 72, in dispatch
            method(conn, *args, **kw)
          File "neo/storage/handlers/master.py", line 36, in notifyUnlockInformation
            self.app.tm.unlock(ttid)
          File "neo/storage/transactions.py", line 329, in unlock
            self.abort(ttid, even_if_locked=True)
          File "neo/storage/transactions.py", line 573, in abort
            not self._replicated.get(self.getPartition(oid))), x
        AssertionError: ('\x00\x00\x00\x00\x00\x03\x03v', '\x03\xca\xb44J\x13\x99\x88', '\x03\xca\xb44J\xe0\xdcU', {}, set(['\x00\x00\x00\x00\x00\x03\x03v']))
      7fff11f6
    • Julien Muchembled's avatar
      client: log_flush most exceptions raised from Application to ZODB · efaae043
      Julien Muchembled authored
      Flushing logs will help fixing NEO bugs (e.g. failed assertions).
      efaae043
    • Julien Muchembled's avatar
      client: fix assertion failure in case of conflict + storage disconnection · a746f812
      Julien Muchembled authored
      This fixes:
      
        Traceback (innermost last):
          ...
          Module transaction._transaction, line 393, in _commitResources
            rm.tpc_vote(self)
          Module ZODB.Connection, line 797, in tpc_vote
            s = vote(transaction)
          Module neo.client.Storage, line 95, in tpc_vote
            return self.app.tpc_vote(transaction)
          Module neo.client.app, line 546, in tpc_vote
            self.waitStoreResponses(txn_context)
          Module neo.client.app, line 539, in waitStoreResponses
            _waitAnyTransactionMessage(txn_context)
          Module neo.client.app, line 160, in _waitAnyTransactionMessage
            self._handleConflicts(txn_context)
          Module neo.client.app, line 471, in _handleConflicts
            assert oid is None, (oid, serial)
        AssertionError: ('\x00\x00\x00\x00\x00\x02\n\xe3', '\x03\xca\xad\xcb!\x92\xb6\x9c')
      a746f812
    • Julien Muchembled's avatar
      client: simplify connection management in transaction contexts · 2851a274
      Julien Muchembled authored
      With previous commit, there's no point anymore to distinguish storage nodes
      for which we only check serials.
      2851a274
    • Julien Muchembled's avatar
      client: also vote to nodes that only check serials · ab435b28
      Julien Muchembled authored
      Not doing so was an incorrect optimization. Checking serials does take
      write-locks and they must not be released when a client-storage connection
      breaks between vote and lock, otherwise a concurrent transaction modifying such
      serials may finish before.
      ab435b28
    • Julien Muchembled's avatar
      qa: deindent code · d7245ee9
      Julien Muchembled authored
      d7245ee9
    • Julien Muchembled's avatar
      Bump protocol version · 9a5b46dd
      Julien Muchembled authored
      9a5b46dd
  6. 07 Nov, 2018 8 commits
    • Julien Muchembled's avatar
      client: fix undetected disconnections to storage nodes during commit · d68e9053
      Julien Muchembled authored
      When a client-storage connection breaks, the storage node discards data of all
      ongoing transactions by the client. Therefore, a reconnection within the
      context of the transaction is wrong, as it could lead to partially-written
      transactions.
      
      This fixes cases where such reconnection happened. The biggest issue was that
      the mechanism to dispatch disconnection events only works when waiting for an
      answer.
      
      The client can still reconnect for other purposes but the new connection won't
      be reused by transactions that already involved the storage node.
      d68e9053
    • Julien Muchembled's avatar
      Fix data corruption due to undetected conflicts after storage failures · 854a4920
      Julien Muchembled authored
      Without this new mechanism to detect oids that aren't write-locked,
      a transaction could be committed successfully without detecting conflicts.
      In the added test, the resulting value was 2, whereas it should be 5 if there
      was no node failure.
      854a4920
    • Julien Muchembled's avatar
      master: notify replicating nodes of aborted watched transactions · 59698faa
      Julien Muchembled authored
      This fixes stuck replication when a client loses connection to the master
      during a commit.
      59698faa
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      client: fix race condition when a storage connection is closed just after identification · bf6569d6
      Julien Muchembled authored
      The consequence was that the client never reconnected to that storage node.
      On commits, writes to that node always failed, causing the master to
      disconnect it.
      bf6569d6
    • Julien Muchembled's avatar
      storage: relax assertion · 21a61977
      Julien Muchembled authored
      Nothing wrong actually happens.
      
      Traceback (most recent call last):
        File "neo/scripts/neostorage.py", line 32, in main
          app.run()
        File "neo/storage/app.py", line 194, in run
          self._run()
        File "neo/storage/app.py", line 225, in _run
          self.doOperation()
        File "neo/storage/app.py", line 310, in doOperation
          poll()
        File "neo/storage/app.py", line 134, in _poll
          self.em.poll(1)
        File "neo/lib/event.py", line 168, in poll
          self._poll(0)
        File "neo/lib/event.py", line 220, in _poll
          if conn.readable():
        File "neo/lib/connection.py", line 483, in readable
          self._closure()
        File "neo/lib/connection.py", line 541, in _closure
          self.close()
        File "neo/lib/connection.py", line 533, in close
          handler.connectionClosed(self)
        File "neo/storage/handlers/client.py", line 46, in connectionClosed
          app.tm.abortFor(conn.getUUID())
        File "neo/storage/transactions.py", line 594, in abortFor
          self.abort(ttid)
        File "neo/storage/transactions.py", line 570, in abort
          self._replicated.get(self.getPartition(oid))), x
      AssertionError: ('\x00\x00\x00\x00\x00\x01a\xe5', '\x03\xcaZ\x04\x14o\x8e\xbb', '\x03\xcaZ\x04\x0eX{\xbb', {1: None, 21: '\x03\xcaZ\x04\x11\xc6\x94\xf6'}, set([]))
      21a61977
    • Julien Muchembled's avatar
      comments, unused import · 1551c4a9
      Julien Muchembled authored
      1551c4a9
  7. 05 Nov, 2018 2 commits
  8. 06 Sep, 2018 2 commits
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      storage: fix assertion failure in case of connection reset with a client node · 652f1f0d
      Julien Muchembled authored
      Here is what happened after simulating a network failure between a client and
      a storage:
      
      C8
      
      DEBUG   recv failed for <SSLSocketConnectorIPv6 at 0x7f8198027f90 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 53970), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECONNRESET (Connection reset by peer)
      DEBUG   connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f81939a0950>
      DEBUG   connection started for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, connecting, client) at 7f8192eb17d0>
      PACKET  #0x0000 RequestIdentification          > S2 ([xxxx:xxxx:60:4c2c::25c3]:39085)        | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761)
      DEBUG   SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECDHE-RSA-AES256-GCM-SHA384 256
      DEBUG   connection completed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, client) at 7f8192eb17d0> (from xxxx:xxxx:120:cd8::90a1:54014)
      DEBUG   <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)> closed in recv
      DEBUG   connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f8192eb17d0>
      ERROR   Connection to <StorageNode(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, state=RUNNING, connection=None, not identified) at 7f81a8874690> failed
      
      S2
      
      DEBUG   accepted a connection from xxxx:xxxx:120:cd8::90a1:54014
      DEBUG   SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f657144a910 fileno 22 ('xxxx:xxxx:60:4c2c::25c3', 39085), opened from ('xxxx:xxxx:120:cd8::90a1', 54014)>: ECDHE-RSA-AES256-GCM-SHA384 256
      DEBUG   connection completed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, fd=22, server) at 7f657144a090> (from xxxx:xxxx:60:4c2c::25c3:39085)
      PACKET  #0x0000 RequestIdentification          < None ([xxxx:xxxx:120:cd8::90a1]:54014)         | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761)
      DEBUG   connection closed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, closed, server) at 7f657144a090>
      WARNING A connection was lost during identification
      ERROR   Pre-mortem data:
      ERROR   Traceback (most recent call last):
      ERROR     File "neo/storage/app.py", line 194, in run
      ERROR       self._run()
      ERROR     File "neo/storage/app.py", line 225, in _run
      ERROR       self.doOperation()
      ERROR     File "neo/storage/app.py", line 310, in doOperation
      ERROR       poll()
      ERROR     File "neo/storage/app.py", line 134, in _poll
      ERROR       self.em.poll(1)
      ERROR     File "neo/lib/event.py", line 160, in poll
      ERROR       to_process.process()
      ERROR     File "neo/lib/connection.py", line 499, in process
      ERROR       self._handlers.handle(self, self._queue.pop(0))
      ERROR     File "neo/lib/connection.py", line 85, in handle
      ERROR       self._handle(connection, packet)
      ERROR     File "neo/lib/connection.py", line 100, in _handle
      ERROR       pending[0][1].packetReceived(connection, packet)
      ERROR     File "neo/lib/handler.py", line 123, in packetReceived
      ERROR       self.dispatch(*args)
      ERROR     File "neo/lib/handler.py", line 72, in dispatch
      ERROR       method(conn, *args, **kw)
      ERROR     File "neo/storage/handlers/identification.py", line 56, in requestIdentification
      ERROR       assert not node.isConnected(), node
      ERROR   AssertionError: <ClientNode(uuid=C8, state=RUNNING, connection=<ServerConnection(uuid=C8, address=[xxxx:xxxx:120:cd8::90a1]:53970, handler=ClientOperationHandler, fd=18, on_close=onConnectionClosed, server) at 7f657147d7d0>) at 7f65714d6cd0>
      652f1f0d