1. 09 May, 2019 1 commit
  2. 28 Apr, 2019 1 commit
    • Julien Muchembled's avatar
      protocol: switch to msgpack for packet serialization · 9d0bf97a
      Julien Muchembled authored
      Not only for performance reasons (at least 3% faster) but also because of
      several ugly things in the way packets were defined:
      - packet field names, which are only documentary; for roots fields,
        they even just duplicate the packet names
      - a lot of repetitions for packet names, and even confusion between the name
        of the packet definition and the name of the actual notify/request packet
      - the need to implement field types for anything, like PByte to support new
        compression formats, since PBoolean is not enough
      
      neo/lib/protocol.py is now much smaller.
      9d0bf97a
  3. 27 Apr, 2019 2 commits
    • Julien Muchembled's avatar
      Make the number of replicas modifiable when the cluster is running · ef5fc508
      Julien Muchembled authored
      neoctl gets a new command to change the number of replicas.
      
      The number of replicas becomes a new partition table attribute and
      like the PT id, it is stored in the config table. On the other side,
      the configuration value for the number of partitions is dropped,
      since it can be computed from the partition table, which is
      always stored in full.
      
      The -p/-r master options now only apply at database creation.
      
      Some implementation notes:
      
      - The protocol is slightly optimized in that the master now sends
        automatically the whole partition tables to the admin & client
        nodes upon connection, like for storage nodes.
        This makes the protocol more consistent, and the master is the
        only remaining node requesting partition tables, during recovery.
      
      - Some parts become tricky because app.pt can be None in more cases.
        For example, the extra condition in NodeManager.update
        (before app.pt.dropNode) was added for this is the reason.
        Or the 'loadPartitionTable' method (storage) that is not inlined
        because of unit tests.
        Overall, this commit simplifies more than it complicates.
      
      - In the master handlers, we stop hijacking the 'connectionCompleted'
        method for tasks to be performed (often send the full partition
        table) on handler switches.
      
      - The admin's 'bootstrapped' flag could have been removed earlier:
        race conditions can't happen since the AskNodeInformation packet
        was removed (commit d048a52d).
      ef5fc508
    • Julien Muchembled's avatar
      New --new-nid storage option for fast cloning · 27e3f620
      Julien Muchembled authored
      It is often faster to set up replicas by stopping a node (and any
      underlying database server like MariaDB) and do a raw copy of the
      database (e.g. with rsync). So far, it required to stop the whole
      cluster and use tools like 'mysql' or sqlite3' to edit:
      - the 'pt' table in databases,
      - the 'config.nid' values of the new nodes.
      
      With this new option, if you already have 1 replica, you can set up
      new replicas with such fast raw copy, and without interruption of
      service. Obviously, this implies less redundancy during the operation.
      27e3f620
  4. 21 Mar, 2019 1 commit
  5. 11 Mar, 2019 1 commit
  6. 21 Nov, 2018 2 commits
    • Julien Muchembled's avatar
      client: fix race condition between Storage.load() and invalidations · a2e278d5
      Julien Muchembled authored
      This fixes a bug that could manifest as follows:
      
        Traceback (most recent call last):
          File "neo/client/app.py", line 432, in load
            self._cache.store(oid, data, tid, next_tid)
          File "neo/client/cache.py", line 223, in store
            assert item.tid == tid, (item, tid)
        AssertionError: (<CacheItem oid='\x00\x00\x00\x00\x00\x00\x00\x01' tid='\x03\xcb\xc6\xca\xfd\xc7\xda\xee' next_tid='\x03\xcb\xc6\xca\xfd\xd8\t\x88' data='...' counter=1 level=1 expire=10000 prev=<...> next=<...>>, '\x03\xcb\xc6\xca\xfd\xd8\t\x88')
      
      The big changes in the threaded test framework are required because we need to
      reproduce a race condition between client threads and this conflicts with the
      serialization of epoll events (deadlock).
      a2e278d5
    • Julien Muchembled's avatar
  7. 08 Nov, 2018 4 commits
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      client: discard late answers to lockless writes · 50e7fe52
      Julien Muchembled authored
      This fixes:
      
        Traceback (most recent call last):
          File "neo/client/Storage.py", line 108, in tpc_vote
            return self.app.tpc_vote(transaction)
          File "neo/client/app.py", line 546, in tpc_vote
            self.waitStoreResponses(txn_context)
          File "neo/client/app.py", line 539, in waitStoreResponses
            _waitAnyTransactionMessage(txn_context)
          File "neo/client/app.py", line 160, in _waitAnyTransactionMessage
            self._handleConflicts(txn_context)
          File "neo/client/app.py", line 514, in _handleConflicts
            self._store(txn_context, oid, serial, data)
          File "neo/client/app.py", line 452, in _store
            self._waitAnyTransactionMessage(txn_context, False)
          File "neo/client/app.py", line 155, in _waitAnyTransactionMessage
            self._waitAnyMessage(queue, block=block)
          File "neo/client/app.py", line 142, in _waitAnyMessage
            _handlePacket(conn, packet, kw)
          File "neo/lib/threaded_app.py", line 133, in _handlePacket
            handler.dispatch(conn, packet, kw)
          File "neo/lib/handler.py", line 72, in dispatch
            method(conn, *args, **kw)
          File "neo/client/handlers/storage.py", line 143, in answerRebaseObject
            assert cached == data
        AssertionError
      50e7fe52
    • Julien Muchembled's avatar
      client: simplify connection management in transaction contexts · 2851a274
      Julien Muchembled authored
      With previous commit, there's no point anymore to distinguish storage nodes
      for which we only check serials.
      2851a274
    • Julien Muchembled's avatar
      client: also vote to nodes that only check serials · ab435b28
      Julien Muchembled authored
      Not doing so was an incorrect optimization. Checking serials does take
      write-locks and they must not be released when a client-storage connection
      breaks between vote and lock, otherwise a concurrent transaction modifying such
      serials may finish before.
      ab435b28
  8. 07 Nov, 2018 2 commits
    • Julien Muchembled's avatar
      client: fix undetected disconnections to storage nodes during commit · d68e9053
      Julien Muchembled authored
      When a client-storage connection breaks, the storage node discards data of all
      ongoing transactions by the client. Therefore, a reconnection within the
      context of the transaction is wrong, as it could lead to partially-written
      transactions.
      
      This fixes cases where such reconnection happened. The biggest issue was that
      the mechanism to dispatch disconnection events only works when waiting for an
      answer.
      
      The client can still reconnect for other purposes but the new connection won't
      be reused by transactions that already involved the storage node.
      d68e9053
    • Julien Muchembled's avatar
      Fix data corruption due to undetected conflicts after storage failures · 854a4920
      Julien Muchembled authored
      Without this new mechanism to detect oids that aren't write-locked,
      a transaction could be committed successfully without detecting conflicts.
      In the added test, the resulting value was 2, whereas it should be 5 if there
      was no node failure.
      854a4920
  9. 05 Nov, 2018 1 commit
  10. 30 Jul, 2018 1 commit
  11. 22 Jun, 2018 1 commit
    • Julien Muchembled's avatar
      Maximize resiliency by taking into account the topology of storage nodes · 97af23cc
      Julien Muchembled authored
      This commit adds a contraint when tweaking the partition table with replicas,
      so that cells of each partition are assigned as far as possible from each
      other, e.g. not on the same machine even if each one has several disks, and
      in any case not on the same storage device.
      
      Currently, the topology path of each node is automatically calculated by the
      storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
      To be improved:
      - Add a storage option to override the path: the 'tweak' algorithm can already
        handle topology paths of any length, so something like (room, machine, disk)
        could be done easily.
      - Write OS-specific code to determine the real hardware behind st_dev
        (e.g. 2 different 'st_dev' values may actually refer to the same disk,
         because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
         and so on).
      - Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
        if it isn't, the master only logs a WARNING during tweak.
      97af23cc
  12. 16 May, 2018 2 commits
    • Julien Muchembled's avatar
      Serialize empty transaction extension with an empty string · a6d4c4e9
      Julien Muchembled authored
      The protocol version is increased to ensure that client nodes are able to
      handle an empty 'extension' field in AnswerTransactionInformation.
      
      It also means that once new transactions are written, going back to a previous
      revision is not possible.
      a6d4c4e9
    • Julien Muchembled's avatar
      client: fix partial import from a source storage · 346c9d00
      Julien Muchembled authored
      The correct way to specify a start/stop tid is when constructing the 'source'
      object, hence the remove of start/stop args. In fact, source.iterator()
      does not always take such args.
      
      On the other hand, when resuming import, Application.importFrom must manage
      with incomplete preindex.
      346c9d00
  13. 16 Apr, 2018 1 commit
    • Julien Muchembled's avatar
      Fix a few issues with ZODB5 · 1316c225
      Julien Muchembled authored
      In the Importer storage backend, the repickler code never really worked with
      ZODB 5 (use of protocol > 1), and now the test does not pass anymore.
      
      The other issues caused by ZODB commit 12ee41c47310156027a674932df34b60de86ba36
      are fixed:
      
        TypeError: list indices must be integers, not binary
      
        ValueError: unsupported pickle protocol: 3
      
      Although not necessary as long as we don't support Python 3,
      this commit also replaces `str` by `bytes` in a few places.
      1316c225
  14. 13 Apr, 2018 1 commit
  15. 12 Apr, 2018 1 commit
  16. 21 Dec, 2017 1 commit
  17. 21 Nov, 2017 1 commit
    • Julien Muchembled's avatar
      client: bug found, add log to collect more information · a1082cbc
      Julien Muchembled authored
      INFO Z2 Log files reopened successfully
      INFO SignalHandler Caught signal SIGTERM
      INFO Z2 Shutting down fast
      INFO ZServer closing HTTP to new connections
      ERROR ZODB.Connection Couldn't load state for BTrees.LOBTree.LOBucket 0xc12e29
      Traceback (most recent call last):
        File "ZODB/Connection.py", line 909, in setstate
          self._setstate(obj, oid)
        File "ZODB/Connection.py", line 953, in _setstate
          p, serial = self._storage.load(oid, '')
        File "neo/client/Storage.py", line 81, in load
          return self.app.load(oid)[:2]
        File "neo/client/app.py", line 355, in load
          data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
        File "neo/client/app.py", line 387, in _loadFromStorage
          askStorage)
        File "neo/client/app.py", line 297, in _askStorageForRead
          self.sync()
        File "neo/client/app.py", line 898, in sync
          self._askPrimary(Packets.Ping())
        File "neo/client/app.py", line 163, in _askPrimary
          return self._ask(self._getMasterConnection(), packet,
        File "neo/client/app.py", line 177, in _getMasterConnection
          result = self.master_conn = self._connectToPrimaryNode()
        File "neo/client/app.py", line 202, in _connectToPrimaryNode
          index = (index + 1) % len(master_list)
      ZeroDivisionError: integer division or modulo by zero
      a1082cbc
  18. 19 Nov, 2017 1 commit
  19. 28 Apr, 2017 1 commit
    • Julien Muchembled's avatar
      client: fix possible data corruption after conflict resolutions with replicas · 46c36465
      Julien Muchembled authored
      This really fixes the bug described in
      commit 40bac312,
      which could probably be reverted. It only reduced the probability of failure.
      
      What happened is that the second conflict on 'a' for t3 what first reported by
      an answer to first store with:
      - a base serial at which a=0
      - a conflict serial at which a=7
      However, the cached data is not 8 anymore but 12, since a second store already
      occurred after the first conflict (reported by the other storage node).
      
      When this conflict was resolved before receiving the conflict for second store,
      it gave:
      
        resolve(old=0, saved=7, new=12) -> 19
      
      instead of:
      
        resolve(old=4, saved=7, new=12) -> 15
      
      (if we still had the data of the first store, we could also do
        resolve(old=0, saved=7, new=8)
       but that would be inefficient from a memory point of view)
      
      The bug was difficult to reproduce. testNotifyReplicated had to be run many
      many times before that race conditions trigger it. The test was changed to
      enforce some of them, and the above scenario now happens almost always.
      46c36465
  20. 24 Apr, 2017 1 commit
    • Julien Muchembled's avatar
      Reimplement election (of the primary master) · 23b6a66a
      Julien Muchembled authored
      The election is not a separate process anymore.
      It happens during the RECOVERING phase, and there's no use of timeouts anymore.
      
      Each master node keeps a timestamp of when it started to play the primary role,
      and the node with the smallest timestamp is elected. The election stops when
      the cluster is started: as long as it is operational, the primary master can't
      be deposed.
      
      An election must happen whenever the cluster is not operational anymore, to
      handle the case of a network cut between a primary master and all other nodes:
      then another master node (secondary) takes over and when the initial primary
      master is back, it loses against the new primary master if the cluster is
      already started.
      23b6a66a
  21. 31 Mar, 2017 2 commits
  22. 23 Mar, 2017 1 commit
  23. 17 Mar, 2017 1 commit
  24. 27 Feb, 2017 1 commit
    • Julien Muchembled's avatar
      Fix oids remaining write-locked forever · 9b33b1db
      Julien Muchembled authored
      This happened in 2 cases:
      - Commit a4c06242 ("Review aborting of
        transactions") introduced a race condition causing oids to remain
        write-locked forever after that the transaction modifying them is aborted.
      - An unfinished transaction is not locked/unlocked during tpc_finish: oids
        must be unlocked when being notified that the transaction is finished.
      9b33b1db
  25. 24 Feb, 2017 1 commit
  26. 21 Feb, 2017 3 commits
  27. 14 Feb, 2017 4 commits