Commits · fa77a5795f15de7c58a9d4304b2419ab0301d918 · nexedi / neoppod

13 Mar, 2019 2 commits

qa: add testrunner options to dump/check the format of network packets · fa77a579

Julien Muchembled authored Jan 02, 2019

With the switch to msgpack, there was no schema anymore whereas it was
sometimes used for both automatic conversion (e.g. the last argument of
AskStoreTransaction must now be explicitly cast to list) and type checking.

This somewhat reintroduces a kind of schema that:
- is used by the test suite for type checking
- can be generated automatically from the test suite
  when one change the procotol

fa77a579

protocol: switch to msgpack for packet serialization · 50fb7793

Julien Muchembled authored May 07, 2018

Not only for performance reasons (which is significant in the case of
replication; tools/matrix is ~3% faster) but also because of several ugly
things in the way packets were defined:
- packet field names, which are only documentary; for roots fields,
  they even just duplicate the packet names
- a lot of repetitions for packet names, and even confusion between the name
  of the packet definition and the name of the actual notify/request packet
- the need to implement field types for anything, like PByte to support new
  compression formats, since PBoolean is not enough

neo/lib/protocol.py is now much smaller.

50fb7793

11 Mar, 2019 3 commits
- Release version 1.11 · 48d936cb
  Julien Muchembled authored Mar 11, 2019
  
  48d936cb
- Fix short descriptions of neoctl & neomigrate in their headers · af2e209b
  Julien Muchembled authored Mar 11, 2019
  
  af2e209b
- Update copyright year · 342168cd
  Julien Muchembled authored Mar 11, 2019
  
  342168cd
26 Feb, 2019 2 commits

qa: new tool to stress-test NEO · 38e98a12

Julien Muchembled authored Oct 18, 2018

Example output:

    stress: yes (toggle with F1)
    cluster state: RUNNING
    last oid: 0x44c0
    last tid: 0x3cdee272ef19355 (2019-02-26 15:35:11.002419)
    clients: 2308, 2311, 2302, 2173, 2226, 2215, 2306, 2255, 2314, 2356 (+48)
            8m53.988s (42.633861/s)
    pt id: 4107
        RRRDDRRR
     0: OU......
     1: ..UO....
     2: ....OU..
     3: ......UU
     4: OU......
     5: ..UO....
     6: ....OU..
     7: ......UU
     8: OU......
     9: ..UO....
    10: ....OU..
    11: ......UU
    12: OU......
    13: ..UO....
    14: ....OU..
    15: ......UU
    16: OU......
    17: ..UO....
    18: ....OU..
    19: ......UU
    20: OU......
    21: ..UO....
    22: ....OU..
    23: ......UU

38e98a12

master: fix typo in comment · ce25e429
Julien Muchembled authored Oct 18, 2018

ce25e429

25 Feb, 2019 1 commit
- Fix error handling when setting up a listening connector · ce608653
  Julien Muchembled authored Feb 25, 2019
```
getAddress (via __repr__) raised EBADF on closed connectors.
```
  ce608653
31 Dec, 2018 7 commits
- Fix incomplete/incorrect mapping of node ids in logs · 1a070186
  Julien Muchembled authored Oct 18, 2018
```
In functional tests (or anything reusing this framework),
the mapping could be incorrect at the beginning of logs.
```
  1a070186
- Fix log corruption on rotation in multi-threaded applications (e.g. client) · 16fdb24d
  Julien Muchembled authored Dec 31, 2018
```
Corrupted logs cause neolog to fail with the following error:

  AttributeError: 'Log' object has no attribute 'uuid_str'
```
  16fdb24d
- sqlite: optimize storage of metadata · 243c1a0f
  Julien Muchembled authored Dec 31, 2018
```
This makes commit 3c7a3160
(storage: speed up reads by indexing 'obj' primarily by 'oid')
effective for SQLite.

The fake changes in test data are because we don't force upgrade
for this optimization.
```
  243c1a0f
- neolog: do not die when a table is corrupted · 49e7d17f
  Julien Muchembled authored Dec 20, 2018
  
  49e7d17f
- neolog: add support for zstd-compressed logs · ad379295
  Julien Muchembled authored Dec 23, 2018
  
  ad379295
- neolog: do not hardcode default value of -L option in help message · 4a96c8b6
  Julien Muchembled authored Dec 07, 2018
  
  4a96c8b6
- fixup! New log format to show node id (and optionally cluster name) in node column · af53946c
  Julien Muchembled authored Dec 23, 2018
```
Commit aa4d621d broke log rotation
and neolog sometimes failed to read in new format.
```
  af53946c
05 Dec, 2018 1 commit
- New log format to show node id (and optionally cluster name) in node column · aa4d621d
  Julien Muchembled authored Nov 25, 2018
```
neolog has new options: -N for old behaviour, and -C to show the cluster name.
```
  aa4d621d
21 Nov, 2018 4 commits

fixup! client: discard late answers to lockless writes · 8ef1ddba
Julien Muchembled authored Nov 09, 2018
```
Since commit 50e7fe52,
some code can be simplified.
```
8ef1ddba

client: fix race condition between Storage.load() and invalidations · a2e278d5

Julien Muchembled authored Nov 19, 2018

This fixes a bug that could manifest as follows:

Traceback (most recent call last):
File "neo/client/app.py", line 432, in load
self._cache.store(oid, data, tid, next_tid)
File "neo/client/cache.py", line 223, in store
assert item.tid == tid, (item, tid)
AssertionError: (<CacheItem oid='\x00\x00\x00\x00\x00\x00\x00\x01' tid='\x03\xcb\xc6\xca\xfd\xc7\xda\xee' next_tid='\x03\xcb\xc6\xca\xfd\xd8\t\x88' data='...' counter=1 level=1 expire=10000 prev=<...> next=<...>>, '\x03\xcb\xc6\xca\xfd\xd8\t\x88')

The big changes in the threaded test framework are required because we need to
reproduce a race condition between client threads and this conflicts with the
serialization of epoll events (deadlock).

a2e278d5

client: fix race condition in refcounting dispatched answer packets · 743026d5

Julien Muchembled authored Nov 16, 2018

This was found when stress-testing a big cluster. 1 client node was stuck:

  (Pdb) pp app.dispatcher.__dict__
  {'lock_acquire': <built-in method acquire of thread.lock object at 0x7f788c6e4250>,
  'lock_release': <built-in method release of thread.lock object at 0x7f788c6e4250>,
  'message_table': {140155667614608: {},
                    140155668875280: {},
                    140155671145872: {},
                    140155672381008: {},
                    140155672381136: {},
                    140155672381456: {},
                    140155673002448: {},
                    140155673449680: {},
                    140155676093648: {170: <neo.lib.locking.SimpleQueue object at 0x7f788a109c58>},
                    140155677536464: {},
                    140155679224336: {},
                    140155679876496: {},
                    140155680702992: {},
                    140155681851920: {},
                    140155681852624: {},
                    140155682773584: {},
                    140155685988880: {},
                    140155693061328: {},
                    140155693062224: {},
                    140155693074960: {},
                    140155696334736: {278: <neo.lib.locking.SimpleQueue object at 0x7f788a109c58>},
                    140155696411408: {},
                    140155696414160: {},
                    140155696576208: {},
                    140155722373904: {}},
  'queue_dict': {140155673622936: 1, 140155689147480: 2}}

140155673622936 should not be queue_dict

743026d5

More RTMIN+2 (log) information for clients and connections · 7e456329
Julien Muchembled authored Nov 14, 2018

7e456329

15 Nov, 2018 3 commits
- storage: check for conflicts when notifying that the a partition is replicated · d66b4f24
  Julien Muchembled authored Nov 06, 2018
  
  d66b4f24
- storage: clarify several assertions · f25b8ee3
  Julien Muchembled authored Nov 07, 2018
  
  f25b8ee3
- qa: new expectedFailure testcase method · 4150ffb1
  Julien Muchembled authored Nov 07, 2018
```
The idea is to write:

  with self.expectedFailure(...): \

just before the statement that is expected to fail. Contrary to the existing
decorator, we want to:
- be sure that the test fails at the expected line;
- be able to remove an expectedFailure without touching the code around.
```
  4150ffb1
08 Nov, 2018 15 commits

client: merge ConnectionPool inside Application · 7494de84
Julien Muchembled authored Oct 17, 2018

7494de84
client: prepare merge of ConnectionPool inside Application · 693aaf79
Julien Muchembled authored Nov 08, 2018

693aaf79

client: fix AssertionError when trying to reconnect too quickly after an error · 305dda86

Julien Muchembled authored Oct 17, 2018

When ConnectionPool._initNodeConnection fails a first time with:

  StorageError: protocol error: already connected

the following assertion failure happens when trying to reconnect before the
previous connection is actually closed (currently, only the node sending an
error message closes the connection, as commented in EventHandler):

  Traceback (most recent call last):
    File "neo/client/Storage.py", line 82, in load
      return self.app.load(oid)[:2]
    File "neo/client/app.py", line 367, in load
      data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
    File "neo/client/app.py", line 399, in _loadFromStorage
      askStorage)
    File "neo/client/app.py", line 293, in _askStorageForRead
      conn = cp.getConnForNode(node)
    File "neo/client/pool.py", line 98, in getConnForNode
      conn = self._initNodeConnection(node)
    File "neo/client/pool.py", line 48, in _initNodeConnection
      dispatcher=app.dispatcher)
    File "neo/lib/connection.py", line 704, in __init__
      super(MTClientConnection, self).__init__(*args, **kwargs)
    File "neo/lib/connection.py", line 602, in __init__
      node.setConnection(self)
    File "neo/lib/node.py", line 122, in setConnection
      attributeTracker.whoSet(self, '_connection'))
  AssertionError

305dda86

qa: fix attributeTracker · 163858ed
Julien Muchembled authored Oct 17, 2018

163858ed
storage: fix storage leak when an oid is stored several times within a transaction · fa14157b
Julien Muchembled authored Oct 15, 2018

fa14157b

client: discard late answers to lockless writes · 50e7fe52

Julien Muchembled authored Oct 09, 2018

This fixes:

  Traceback (most recent call last):
    File "neo/client/Storage.py", line 108, in tpc_vote
      return self.app.tpc_vote(transaction)
    File "neo/client/app.py", line 546, in tpc_vote
      self.waitStoreResponses(txn_context)
    File "neo/client/app.py", line 539, in waitStoreResponses
      _waitAnyTransactionMessage(txn_context)
    File "neo/client/app.py", line 160, in _waitAnyTransactionMessage
      self._handleConflicts(txn_context)
    File "neo/client/app.py", line 514, in _handleConflicts
      self._store(txn_context, oid, serial, data)
    File "neo/client/app.py", line 452, in _store
      self._waitAnyTransactionMessage(txn_context, False)
    File "neo/client/app.py", line 155, in _waitAnyTransactionMessage
      self._waitAnyMessage(queue, block=block)
    File "neo/client/app.py", line 142, in _waitAnyMessage
      _handlePacket(conn, packet, kw)
    File "neo/lib/threaded_app.py", line 133, in _handlePacket
      handler.dispatch(conn, packet, kw)
    File "neo/lib/handler.py", line 72, in dispatch
      method(conn, *args, **kw)
    File "neo/client/handlers/storage.py", line 143, in answerRebaseObject
      assert cached == data
  AssertionError

50e7fe52

qa: in threaded tests, log queued packets when "tic is looping forever" · 82672031
Julien Muchembled authored Oct 15, 2018

82672031
In logs, dump the partition table in a more compact and readable way · 323fd636
Julien Muchembled authored Oct 05, 2018

323fd636

storage: fix write-locking bug when a deadlock happens at the end of a replication · 7fff11f6

Julien Muchembled authored Oct 05, 2018

During rebase, writes could stay lockless although the partition was
replicated. Another transaction could then take locks prematurely, leading to
the following crash:

  Traceback (most recent call last):
    File "neo/lib/handler.py", line 72, in dispatch
      method(conn, *args, **kw)
    File "neo/storage/handlers/master.py", line 36, in notifyUnlockInformation
      self.app.tm.unlock(ttid)
    File "neo/storage/transactions.py", line 329, in unlock
      self.abort(ttid, even_if_locked=True)
    File "neo/storage/transactions.py", line 573, in abort
      not self._replicated.get(self.getPartition(oid))), x
  AssertionError: ('\x00\x00\x00\x00\x00\x03\x03v', '\x03\xca\xb44J\x13\x99\x88', '\x03\xca\xb44J\xe0\xdcU', {}, set(['\x00\x00\x00\x00\x00\x03\x03v']))

7fff11f6

client: log_flush most exceptions raised from Application to ZODB · efaae043
Julien Muchembled authored Oct 03, 2018
```
Flushing logs will help fixing NEO bugs (e.g. failed assertions).
```
efaae043

client: fix assertion failure in case of conflict + storage disconnection · a746f812

Julien Muchembled authored Oct 02, 2018

This fixes:

  Traceback (innermost last):
    ...
    Module transaction._transaction, line 393, in _commitResources
      rm.tpc_vote(self)
    Module ZODB.Connection, line 797, in tpc_vote
      s = vote(transaction)
    Module neo.client.Storage, line 95, in tpc_vote
      return self.app.tpc_vote(transaction)
    Module neo.client.app, line 546, in tpc_vote
      self.waitStoreResponses(txn_context)
    Module neo.client.app, line 539, in waitStoreResponses
      _waitAnyTransactionMessage(txn_context)
    Module neo.client.app, line 160, in _waitAnyTransactionMessage
      self._handleConflicts(txn_context)
    Module neo.client.app, line 471, in _handleConflicts
      assert oid is None, (oid, serial)
  AssertionError: ('\x00\x00\x00\x00\x00\x02\n\xe3', '\x03\xca\xad\xcb!\x92\xb6\x9c')

a746f812

client: simplify connection management in transaction contexts · 2851a274
Julien Muchembled authored Oct 01, 2018
```
With previous commit, there's no point anymore to distinguish storage nodes
for which we only check serials.
```
2851a274

client: also vote to nodes that only check serials · ab435b28

Julien Muchembled authored Oct 01, 2018

Not doing so was an incorrect optimization. Checking serials does take
write-locks and they must not be released when a client-storage connection
breaks between vote and lock, otherwise a concurrent transaction modifying such
serials may finish before.

ab435b28

qa: deindent code · d7245ee9
Julien Muchembled authored Sep 28, 2018

d7245ee9
Bump protocol version · 9a5b46dd
Julien Muchembled authored Oct 02, 2018

9a5b46dd

07 Nov, 2018 2 commits

client: fix undetected disconnections to storage nodes during commit · d68e9053

Julien Muchembled authored Sep 18, 2018

When a client-storage connection breaks, the storage node discards data of all
ongoing transactions by the client. Therefore, a reconnection within the
context of the transaction is wrong, as it could lead to partially-written
transactions.

This fixes cases where such reconnection happened. The biggest issue was that
the mechanism to dispatch disconnection events only works when waiting for an
answer.

The client can still reconnect for other purposes but the new connection won't
be reused by transactions that already involved the storage node.

d68e9053

Fix data corruption due to undetected conflicts after storage failures · 854a4920

Julien Muchembled authored Sep 24, 2018

Without this new mechanism to detect oids that aren't write-locked,
a transaction could be committed successfully without detecting conflicts.
In the added test, the resulting value was 2, whereas it should be 5 if there
was no node failure.

854a4920