1. 27 Apr, 2019 2 commits
    • Julien Muchembled's avatar
      Make the number of replicas modifiable when the cluster is running · ef5fc508
      Julien Muchembled authored
      neoctl gets a new command to change the number of replicas.
      The number of replicas becomes a new partition table attribute and
      like the PT id, it is stored in the config table. On the other side,
      the configuration value for the number of partitions is dropped,
      since it can be computed from the partition table, which is
      always stored in full.
      The -p/-r master options now only apply at database creation.
      Some implementation notes:
      - The protocol is slightly optimized in that the master now sends
        automatically the whole partition tables to the admin & client
        nodes upon connection, like for storage nodes.
        This makes the protocol more consistent, and the master is the
        only remaining node requesting partition tables, during recovery.
      - Some parts become tricky because app.pt can be None in more cases.
        For example, the extra condition in NodeManager.update
        (before app.pt.dropNode) was added for this is the reason.
        Or the 'loadPartitionTable' method (storage) that is not inlined
        because of unit tests.
        Overall, this commit simplifies more than it complicates.
      - In the master handlers, we stop hijacking the 'connectionCompleted'
        method for tasks to be performed (often send the full partition
        table) on handler switches.
      - The admin's 'bootstrapped' flag could have been removed earlier:
        race conditions can't happen since the AskNodeInformation packet
        was removed (commit d048a52d).
    • Julien Muchembled's avatar
      New --new-nid storage option for fast cloning · 27e3f620
      Julien Muchembled authored
      It is often faster to set up replicas by stopping a node (and any
      underlying database server like MariaDB) and do a raw copy of the
      database (e.g. with rsync). So far, it required to stop the whole
      cluster and use tools like 'mysql' or sqlite3' to edit:
      - the 'pt' table in databases,
      - the 'config.nid' values of the new nodes.
      With this new option, if you already have 1 replica, you can set up
      new replicas with such fast raw copy, and without interruption of
      service. Obviously, this implies less redundancy during the operation.
  2. 11 Mar, 2019 1 commit
  3. 07 Nov, 2018 1 commit
  4. 05 Nov, 2018 1 commit
  5. 06 Sep, 2018 1 commit
    • Julien Muchembled's avatar
      storage: fix assertion failure in case of connection reset with a client node · 652f1f0d
      Julien Muchembled authored
      Here is what happened after simulating a network failure between a client and
      a storage:
      DEBUG   recv failed for <SSLSocketConnectorIPv6 at 0x7f8198027f90 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 53970), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECONNRESET (Connection reset by peer)
      DEBUG   connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f81939a0950>
      DEBUG   connection started for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, connecting, client) at 7f8192eb17d0>
      PACKET  #0x0000 RequestIdentification          > S2 ([xxxx:xxxx:60:4c2c::25c3]:39085)        | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761)
      DEBUG   SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)>: ECDHE-RSA-AES256-GCM-SHA384 256
      DEBUG   connection completed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, fd=17, on_close=onConnectionClosed, client) at 7f8192eb17d0> (from xxxx:xxxx:120:cd8::90a1:54014)
      DEBUG   <SSLSocketConnectorIPv6 at 0x7f8192eb1850 fileno 17 ('xxxx:xxxx:120:cd8::90a1', 54014), opened to ('xxxx:xxxx:60:4c2c::25c3', 39085)> closed in recv
      DEBUG   connection closed for <MTClientConnection(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, handler=StorageEventHandler, closed, client) at 7f8192eb17d0>
      ERROR   Connection to <StorageNode(uuid=S2, address=[xxxx:xxxx:60:4c2c::25c3]:39085, state=RUNNING, connection=None, not identified) at 7f81a8874690> failed
      DEBUG   accepted a connection from xxxx:xxxx:120:cd8::90a1:54014
      DEBUG   SSL handshake done for <SSLSocketConnectorIPv6 at 0x7f657144a910 fileno 22 ('xxxx:xxxx:60:4c2c::25c3', 39085), opened from ('xxxx:xxxx:120:cd8::90a1', 54014)>: ECDHE-RSA-AES256-GCM-SHA384 256
      DEBUG   connection completed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, fd=22, server) at 7f657144a090> (from xxxx:xxxx:60:4c2c::25c3:39085)
      PACKET  #0x0000 RequestIdentification          < None ([xxxx:xxxx:120:cd8::90a1]:54014)         | (<EnumItem CLIENT (2)>, -536870904, None, '...', [], 1535555463.455761)
      DEBUG   connection closed for <ServerConnection(uuid=None, address=[xxxx:xxxx:120:cd8::90a1]:54014, handler=IdentificationHandler, closed, server) at 7f657144a090>
      WARNING A connection was lost during identification
      ERROR   Pre-mortem data:
      ERROR   Traceback (most recent call last):
      ERROR     File "neo/storage/app.py", line 194, in run
      ERROR       self._run()
      ERROR     File "neo/storage/app.py", line 225, in _run
      ERROR       self.doOperation()
      ERROR     File "neo/storage/app.py", line 310, in doOperation
      ERROR       poll()
      ERROR     File "neo/storage/app.py", line 134, in _poll
      ERROR       self.em.poll(1)
      ERROR     File "neo/lib/event.py", line 160, in poll
      ERROR       to_process.process()
      ERROR     File "neo/lib/connection.py", line 499, in process
      ERROR       self._handlers.handle(self, self._queue.pop(0))
      ERROR     File "neo/lib/connection.py", line 85, in handle
      ERROR       self._handle(connection, packet)
      ERROR     File "neo/lib/connection.py", line 100, in _handle
      ERROR       pending[0][1].packetReceived(connection, packet)
      ERROR     File "neo/lib/handler.py", line 123, in packetReceived
      ERROR       self.dispatch(*args)
      ERROR     File "neo/lib/handler.py", line 72, in dispatch
      ERROR       method(conn, *args, **kw)
      ERROR     File "neo/storage/handlers/identification.py", line 56, in requestIdentification
      ERROR       assert not node.isConnected(), node
      ERROR   AssertionError: <ClientNode(uuid=C8, state=RUNNING, connection=<ServerConnection(uuid=C8, address=[xxxx:xxxx:120:cd8::90a1]:53970, handler=ClientOperationHandler, fd=18, on_close=onConnectionClosed, server) at 7f657147d7d0>) at 7f65714d6cd0>
  6. 22 Jun, 2018 2 commits
    • Julien Muchembled's avatar
      Maximize resiliency by taking into account the topology of storage nodes · 97af23cc
      Julien Muchembled authored
      This commit adds a contraint when tweaking the partition table with replicas,
      so that cells of each partition are assigned as far as possible from each
      other, e.g. not on the same machine even if each one has several disks, and
      in any case not on the same storage device.
      Currently, the topology path of each node is automatically calculated by the
      storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
      To be improved:
      - Add a storage option to override the path: the 'tweak' algorithm can already
        handle topology paths of any length, so something like (room, machine, disk)
        could be done easily.
      - Write OS-specific code to determine the real hardware behind st_dev
        (e.g. 2 different 'st_dev' values may actually refer to the same disk,
         because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
         and so on).
      - Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
        if it isn't, the master only logs a WARNING during tweak.
    • Julien Muchembled's avatar
      storage: also commit updated cell TID at each replicated chunk of 'obj' records · d4ea398d
      Julien Muchembled authored
      This is a follow-up of commit b3dd6973
      ("Optimize resumption of replication by starting from a greater TID").
      I missed the case where a storage node is restarted while it is replicating:
      it lost the TID where it was interrupted.
      Although we commit after each replicated chunk, to avoid transferring again
      all the data from the beginning, it could still waste time to check that
      the data are already replicated.
  7. 21 Jun, 2018 1 commit
  8. 19 Jun, 2018 1 commit
  9. 30 May, 2018 2 commits
    • Julien Muchembled's avatar
      protocol: update packet docstrings · 9f0f2afe
      Julien Muchembled authored
      /reviewed-on !9
    • Julien Muchembled's avatar
      Optimize resumption of replication by starting from a greater TID · b3dd6973
      Julien Muchembled authored
      Although data that are already transferred aren't transferred again, checking
      that the data are there for a whole partition can still be a lot of work for
      big databases. This commit is a major performance improvement in that a storage
      node that gets disconnected for a short time now gets fully operational quite
      instantaneously because it only has to replicate the new data. Before, the time
      to recover depended on the size of the DB.
      For OUT_OF_DATE cells, the difficult part was that they are writable and
      can then contain holes, so we can't just take the last TID in trans/obj
      (we wrongly did that at the beginning, and then committed
      6b1f198f as a workaround). We solve that
      by storing up to where it was up-to-date: this value is initialized from
      the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING.
      There's actually one such OUT_OF_DATE TID per assigned cell (backends store
      these values in the 'pt' table). Otherwise, a cell that still has a lot to
      replicate would still cause all other cells to resume from the a very small
      TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node
      (as a result of tweak).
      For UP_TO_DATE cells of a backup cluster, replication was resumed from the
      maximum TID at which all assigned cells are known to be fully replicated.
      Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of
      extra work for others, the worst case being when setting up a backup cluster
      (it always restarted from ZERO_TID as long as at least 1 cell was still empty).
      Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to
      store extra information: we simply look at the last TIDs in trans/obj.
      We even handle trans & obj independently, to minimize the work in 1 table
      (i.e. trans since it's processed first) if the other is late (obj).
      There's a small change in the protocol so that OUT_OF_DATE enum value equals 0.
      This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column
      as the cell state.
      Note about MySQL changes in commit ca58ccd7:
      what we did as a workaround is not one any more. Now, we do so much on Python
      side that it's unlikely we could reduce the number of queries using GROUP BY.
      We even stopped doing that for SQLite.
  10. 14 Mar, 2018 1 commit
    • Julien Muchembled's avatar
      storage: fix replication of creation undone · c3343279
      Julien Muchembled authored
      For records that undo object creation, None values are used at the backend
      level whereas the protocol is not designed to serialize None for any field.
      Therefore, a dance done in many places around packet serialization, using the
      specific 0/ZERO_HASH/'' triplet to represent a deleted oid. For replication,
      it was missing at the sender side, leading to the following crash:
        Traceback (most recent call last):
          File "neo/storage/app.py", line 147, in run
          File "neo/storage/app.py", line 178, in _run
          File "neo/storage/app.py", line 257, in doOperation
            next(task_queue[-1]) or task_queue.rotate()
          File "neo/storage/handlers/storage.py", line 271, in push
            conn.send(Packets.AddObject(oid, *object), msg_id)
          File "neo/lib/protocol.py", line 234, in __init__
            self._fmt.encode(buf.write, args)
          File "neo/lib/protocol.py", line 345, in encode
            return self._trace(self._encode, writer, items)
          File "neo/lib/protocol.py", line 334, in _trace
            return method(*args)
          File "neo/lib/protocol.py", line 367, in _encode
            item.encode(writer, value)
          File "neo/lib/protocol.py", line 345, in encode
            return self._trace(self._encode, writer, items)
          File "neo/lib/protocol.py", line 342, in _trace
            raise ParseError(self, trace)
        ParseError: at add_object/checksum:
          File "neo/lib/protocol.py", line 553, in _encode
            assert len(checksum) == 20, (len(checksum), checksum)
        TypeError: object of type 'NoneType' has no len()
  11. 08 Jan, 2018 1 commit
    • Julien Muchembled's avatar
      storage: optimize storage layout of raw data for replication · f4dd4bab
      Julien Muchembled authored
      # Previous status
      The issue was that we had extreme storage fragmentation from the point of view
      of the replication algorithm, which processes one partition at a time.
      By using an autoincrement for the 'data' table, rows were ordered by the time
      at which they were added:
      - parts may be the result of replication -> ordered by partition, tid, oid
      - other rows are globally sorted by tid
      Which means that when scanning a given partition, many rows were skipped all
      the time:
      - if readahead is bigger enough, the efficiency is 1/N for a node with N
        partitions assigned
      - else, it is worse because it seeks all the time
      For huge databases, the replication was horribly slow, in particular from HDD.
      # Chosen solution
      This commit changes how ids are generated to somehow split 'data'
      per partition. The backend tracks 1 last id per assigned partition, where the
      16 higher bits contains the partition. Keep in mind that the value of id has no
      meaning and it's only chosen for performance reasons. IOW, a row can be
      referred by an oid of a partition different than the 16 higher bits of id:
      - there's no migration needed and the 16 higher bits of all existing rows are 0
      - in case of deduplication, a row can still be shared by different partitions
      Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
      on existing databases.
      ## Downsides
      On insertion, increasing the number of partitions now slows down significantly:
      for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
      partitions, the difference remains negligible. The solution for this issue will
      be to enable to increase the number of partitions efficiently, so that nodes
      can keep a small number of them, even for DB that are expected to grow so much
      that many nodes are added over time: such feature was already considered so
      that users don't have to worry anymore about this obscure setting at database
      Read performance is only slowed down for applications that read a lot of data
      that were written contiguously, but split in small blocks. A solution is to
      extend ZODB so that the application tells it to chose new oids that will end up
      in the same partition. Like for insertion, there should not be too many
      With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
      collect all last ids at startup when there are many partitions.
      ## Other advantages
      - The storage layout of data is now always the same and does not depend on
        whether rows came from replication or commits.
      - Efficient deletion of partition to free space in-place will be possible.
      # Considered alternative
      The only serious alternative was to replicate as many partitions as possible at
      the same time, ideally all assigned partitions, but it's not always possible.
      For best performance, it would often require to synchronize new nodes, or even
      all of them, so that thesource nodes don't have to scan 'data' several times.
      If existing nodes are kept, all data that aren't copied to the newly added
      nodes have to be skipped. If the number of nodes is multiplied by N, the
      efficiency is 1-1/N at best (synchronized nodes), else it's even worse
      because partitions are somehow shuffled.
      Checking/replacing a single node would remain slow when there are several
      source nodes.
      At last, such an algorithm would be much more complex and we would not have the
      other advantages listed above.
  12. 13 Dec, 2017 1 commit
  13. 05 Dec, 2017 2 commits
  14. 12 Jun, 2017 1 commit
  15. 12 May, 2017 1 commit
    • Julien Muchembled's avatar
      Remove packet timeouts · f6eb02b4
      Julien Muchembled authored
      Since it's not worth anymore to keep track of the last connection activity
      (which, btw, ignored TCP ACKs, i.e. timeouts could theorically be triggered
      before all the data were actually sent), the semantics of closeClient has also
      changed. Before this commit, the 1-minute timeout was reset whenever there was
      activity (connection still used as server). Now, it happens exactly 100 seconds
      after the connection is not used anymore as client.
  16. 25 Apr, 2017 2 commits
  17. 24 Apr, 2017 3 commits
    • Julien Muchembled's avatar
      Reimplement election (of the primary master) · 23b6a66a
      Julien Muchembled authored
      The election is not a separate process anymore.
      It happens during the RECOVERING phase, and there's no use of timeouts anymore.
      Each master node keeps a timestamp of when it started to play the primary role,
      and the node with the smallest timestamp is elected. The election stops when
      the cluster is started: as long as it is operational, the primary master can't
      be deposed.
      An election must happen whenever the cluster is not operational anymore, to
      handle the case of a network cut between a primary master and all other nodes:
      then another master node (secondary) takes over and when the initial primary
      master is back, it loses against the new primary master if the cluster is
      already started.
    • Julien Muchembled's avatar
      Remove BROKEN node state · 9d7f9795
      Julien Muchembled authored
    • Julien Muchembled's avatar
      Remove HIDDEN node state · b8210d58
      Julien Muchembled authored
  18. 31 Mar, 2017 2 commits
    • Julien Muchembled's avatar
      storage: fix commit activity when cells are discarded or when they become readable · 34d797e2
      Julien Muchembled authored
      This is a follow up of commit 64afd7d2,
      which focused on read accesses when there is no transaction activity.
      This commit also includes a test to check a simpler scenario that the one
      described in the previous commit.
    • Julien Muchembled's avatar
      master: make sure that storage nodes have an up-to-date PT/NM when they're added · 7ffc96fd
      Julien Muchembled authored
      This revert commit bddc1802,
      to fix the following storage crash:
        Traceback (most recent call last):
          File "neo/lib/handler.py", line 72, in dispatch
            method(conn, *args, **kw)
          File "neo/storage/handlers/master.py", line 44, in notifyPartitionChanges
            app.pt.update(ptid, cell_list, app.nm)
          File "neo/lib/pt.py", line 231, in update
            assert node is not None, 'No node found for uuid ' + uuid_str(uuid)
        AssertionError: No node found for uuid S3
      Partitition table updates must also be processed with InitializationHandler
      when nodes remain in PENDING state because they're not added to the cluster.
  19. 23 Mar, 2017 4 commits
    • Julien Muchembled's avatar
      storage: in deadlock avoidance, fix performance issue that could freeze the cluster · 1280f73e
      Julien Muchembled authored
      In the worst case, with many clients trying to lock the same oids,
      the cluster could enter in an infinite cascade of deadlocks.
      Here is an overview with 3 storage nodes and 3 transactions:
       S1     S2     S3     order of locking tids          # abbreviations:
       l1     l1     l2     123                            #  l: lock
       q23    q23    d1q3   231                            #  d: deadlock triggered
       r1:l3  r1:l2  (r1)   # for S3, we still have l2     #  q: queued
       d2q1   q13    q13    312                            #  r: rebase
      Above, we show what happens when a random transaction gets a lock just after
      that another is rebased. Here, the result is that the last 2 lines are a
      permutation of the first 2, and this can repeat indefinitely with bad luck.
      This commit reduces the probability of deadlock by processing delayed
      stores/checks in the order of their locking tid. In the above example,
      S1 would give the lock to 2 when 1 is rebased, and 2 would vote successfully.
    • Julien Muchembled's avatar
      storage: discard answers from aborted replications · ad43dcd3
      Julien Muchembled authored
      This fixes a bug that could to data corruption or crashes.
    • Julien Muchembled's avatar
      Use Connection.send instead of answer when a packet id must be reused · 4222ac8a
      Julien Muchembled authored
      It becomes possible to answer with several packets:
      - the last is the usual associated answer packet
      - all other (previously sent) packets are notifications
      Connection.send does not return the packet id anymore. This is not useful
      enough, and the caller can inspect the sent packet (getId).
    • Julien Muchembled's avatar
  20. 18 Mar, 2017 1 commit
    • Julien Muchembled's avatar
      master: fix crash when a transaction begins while a storage node starts operation · 781b4eb5
      Julien Muchembled authored
      Traceback (most recent call last):
        File "neo/lib/handler.py", line 72, in dispatch
          method(conn, *args, **kw)
        File "neo/master/handlers/client.py", line 70, in askFinishTransaction
        File "neo/master/transactions.py", line 387, in prepare
          assert node_list, (ready, failed)
      AssertionError: (set([]), frozenset([]))
      Master log leading to the crash:
        PACKET    #0x0009 StartOperation                 > S1
        PACKET    #0x0004 BeginTransaction               < C1
        DEBUG     Begin <...>
        PACKET    #0x0004 AnswerBeginTransaction         > C1
        PACKET    #0x0001 NotifyReady                    < S1
      It was wrong to process BeginTransaction before receiving NotifyReady.
      The changes in the storage are cosmetics: the 'ready' attribute has become
      redundant with 'operational'.
  21. 02 Mar, 2017 1 commit
    • Julien Muchembled's avatar
      storage: fix PT updates in case of late AnswerUnfinishedTransactions · a74937c8
      Julien Muchembled authored
      This is done by moving
      after the switch to MasterOperationHandler, so that the latter is not delayed.
      This change comes with some refactoring of the main loop,
      to clean up app.checker and app.replicator properly (like app.tm).
      Another option could have been to process notifications with the last handler,
      instead of the first one. But if possible, cleaning up the whole code to not
      delay handlers anymore looks the best option.
  22. 27 Feb, 2017 1 commit
    • Julien Muchembled's avatar
      Fix oids remaining write-locked forever · 9b33b1db
      Julien Muchembled authored
      This happened in 2 cases:
      - Commit a4c06242 ("Review aborting of
        transactions") introduced a race condition causing oids to remain
        write-locked forever after that the transaction modifying them is aborted.
      - An unfinished transaction is not locked/unlocked during tpc_finish: oids
        must be unlocked when being notified that the transaction is finished.
  23. 21 Feb, 2017 3 commits
    • Julien Muchembled's avatar
      Implement deadlock avoidance · 092992db
      Julien Muchembled authored
      This is a first version with several optimizations possible:
      - improve EventQueue (or implement a specific queue) to minimize deadlocks
      - turn the RebaseObject packet into a notification
      Sorting oids could also be useful to reduce the probability of deadlocks,
      but that would never be enough to avoid them completely, even if there's a
      single storage. For example:
      1. C1 does a first store (x or y)
      2. C2 stores x and y; one is delayed
      3. C1 stores the other -> deadlock
         When solving the deadlock, the data of the first store may only
         exist on the storage.
      2 functional tests are removed because they're redundant,
      either with ZODB tests or with the new threaded tests.
    • Julien Muchembled's avatar
      Fixes/improvements to EventQueue · cc8d0a7c
      Julien Muchembled authored
      - Make sure that errors while processing a delayed packet are reported to the
        connection that sent this packet.
      - Provide a mechanism to process events for the same connection in
        chronological order.
    • Julien Muchembled's avatar
  24. 14 Feb, 2017 3 commits
  25. 02 Feb, 2017 1 commit