1. 27 Apr, 2019 1 commit
    • Julien Muchembled's avatar
      New --new-nid storage option for fast cloning · 27e3f620
      Julien Muchembled authored
      It is often faster to set up replicas by stopping a node (and any
      underlying database server like MariaDB) and do a raw copy of the
      database (e.g. with rsync). So far, it required to stop the whole
      cluster and use tools like 'mysql' or sqlite3' to edit:
      - the 'pt' table in databases,
      - the 'config.nid' values of the new nodes.
      With this new option, if you already have 1 replica, you can set up
      new replicas with such fast raw copy, and without interruption of
      service. Obviously, this implies less redundancy during the operation.
  2. 11 Mar, 2019 1 commit
  3. 22 Jun, 2018 2 commits
    • Julien Muchembled's avatar
      Maximize resiliency by taking into account the topology of storage nodes · 97af23cc
      Julien Muchembled authored
      This commit adds a contraint when tweaking the partition table with replicas,
      so that cells of each partition are assigned as far as possible from each
      other, e.g. not on the same machine even if each one has several disks, and
      in any case not on the same storage device.
      Currently, the topology path of each node is automatically calculated by the
      storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
      To be improved:
      - Add a storage option to override the path: the 'tweak' algorithm can already
        handle topology paths of any length, so something like (room, machine, disk)
        could be done easily.
      - Write OS-specific code to determine the real hardware behind st_dev
        (e.g. 2 different 'st_dev' values may actually refer to the same disk,
         because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
         and so on).
      - Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
        if it isn't, the master only logs a WARNING during tweak.
    • Julien Muchembled's avatar
      storage: also commit updated cell TID at each replicated chunk of 'obj' records · d4ea398d
      Julien Muchembled authored
      This is a follow-up of commit b3dd6973
      ("Optimize resumption of replication by starting from a greater TID").
      I missed the case where a storage node is restarted while it is replicating:
      it lost the TID where it was interrupted.
      Although we commit after each replicated chunk, to avoid transferring again
      all the data from the beginning, it could still waste time to check that
      the data are already replicated.
  4. 30 May, 2018 1 commit
    • Julien Muchembled's avatar
      Optimize resumption of replication by starting from a greater TID · b3dd6973
      Julien Muchembled authored
      Although data that are already transferred aren't transferred again, checking
      that the data are there for a whole partition can still be a lot of work for
      big databases. This commit is a major performance improvement in that a storage
      node that gets disconnected for a short time now gets fully operational quite
      instantaneously because it only has to replicate the new data. Before, the time
      to recover depended on the size of the DB.
      For OUT_OF_DATE cells, the difficult part was that they are writable and
      can then contain holes, so we can't just take the last TID in trans/obj
      (we wrongly did that at the beginning, and then committed
      6b1f198f as a workaround). We solve that
      by storing up to where it was up-to-date: this value is initialized from
      the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING.
      There's actually one such OUT_OF_DATE TID per assigned cell (backends store
      these values in the 'pt' table). Otherwise, a cell that still has a lot to
      replicate would still cause all other cells to resume from the a very small
      TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node
      (as a result of tweak).
      For UP_TO_DATE cells of a backup cluster, replication was resumed from the
      maximum TID at which all assigned cells are known to be fully replicated.
      Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of
      extra work for others, the worst case being when setting up a backup cluster
      (it always restarted from ZERO_TID as long as at least 1 cell was still empty).
      Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to
      store extra information: we simply look at the last TIDs in trans/obj.
      We even handle trans & obj independently, to minimize the work in 1 table
      (i.e. trans since it's processed first) if the other is late (obj).
      There's a small change in the protocol so that OUT_OF_DATE enum value equals 0.
      This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column
      as the cell state.
      Note about MySQL changes in commit ca58ccd7:
      what we did as a workaround is not one any more. Now, we do so much on Python
      side that it's unlikely we could reduce the number of queries using GROUP BY.
      We even stopped doing that for SQLite.
  5. 12 May, 2017 1 commit
    • Julien Muchembled's avatar
      Remove packet timeouts · f6eb02b4
      Julien Muchembled authored
      Since it's not worth anymore to keep track of the last connection activity
      (which, btw, ignored TCP ACKs, i.e. timeouts could theorically be triggered
      before all the data were actually sent), the semantics of closeClient has also
      changed. Before this commit, the 1-minute timeout was reset whenever there was
      activity (connection still used as server). Now, it happens exactly 100 seconds
      after the connection is not used anymore as client.
  6. 31 Mar, 2017 2 commits
  7. 23 Mar, 2017 1 commit
  8. 18 Mar, 2017 1 commit
    • Julien Muchembled's avatar
      master: fix crash when a transaction begins while a storage node starts operation · 781b4eb5
      Julien Muchembled authored
      Traceback (most recent call last):
        File "neo/lib/handler.py", line 72, in dispatch
          method(conn, *args, **kw)
        File "neo/master/handlers/client.py", line 70, in askFinishTransaction
        File "neo/master/transactions.py", line 387, in prepare
          assert node_list, (ready, failed)
      AssertionError: (set([]), frozenset([]))
      Master log leading to the crash:
        PACKET    #0x0009 StartOperation                 > S1
        PACKET    #0x0004 BeginTransaction               < C1
        DEBUG     Begin <...>
        PACKET    #0x0004 AnswerBeginTransaction         > C1
        PACKET    #0x0001 NotifyReady                    < S1
      It was wrong to process BeginTransaction before receiving NotifyReady.
      The changes in the storage are cosmetics: the 'ready' attribute has become
      redundant with 'operational'.
  9. 02 Mar, 2017 1 commit
    • Julien Muchembled's avatar
      storage: fix PT updates in case of late AnswerUnfinishedTransactions · a74937c8
      Julien Muchembled authored
      This is done by moving
      after the switch to MasterOperationHandler, so that the latter is not delayed.
      This change comes with some refactoring of the main loop,
      to clean up app.checker and app.replicator properly (like app.tm).
      Another option could have been to process notifications with the last handler,
      instead of the first one. But if possible, cleaning up the whole code to not
      delay handlers anymore looks the best option.
  10. 27 Feb, 2017 1 commit
    • Julien Muchembled's avatar
      storage: fix bug not replicating unfinished transactions when the last ones are aborted · 7f754b5e
      Julien Muchembled authored
      This was found by the first assertion of answerRebaseObject (client) because
      a storage node missed a few transactions and reported a conflict with an older
      serial than the one being stored: this must never happen and this commit adds a
      more generic assertion on the storage side.
      The above case is when the "first phase" of replication of a partition
      (all history up to the tid before unfinished transactions) ended after
      that the unfinished transactions are finished: this was a corruption bug,
      where UP_TO_DATE cells could miss data.
      Otherwise, if the "first phase" ended before, then the partition remained stuck
      in OUT_OF_DATE state. Restarting the storage node was enough to recover.
  11. 24 Feb, 2017 1 commit
    • Julien Muchembled's avatar
      storage: fix an AssertionError in internal replication · 560e4fb1
      Julien Muchembled authored
      Traceback (most recent call last):
        File "neo/storage/handlers/storage.py", line 111, in answerFetchObjects
        File "neo/storage/replicator.py", line 370, in finish
        File "neo/storage/replicator.py", line 279, in _nextPartition
          assert app.pt.getCell(offset, app.uuid).isOutOfDate()
      The scenario is:
      1. partition A: start of replication, with unfinished transactions
      2. partition A: all unfinished transactions are finished
      3. partition A: end of replication with ReplicationDone notification
      4. replication of partition B
      5. partition A: AssertionError when starting replication
      The bug is that in 3, the partition A is partially replicated and the storage
      node must not notify the master.
  12. 14 Feb, 2017 3 commits
  13. 18 Jan, 2017 1 commit
  14. 22 Dec, 2016 1 commit
  15. 21 Dec, 2016 1 commit
    • Julien Muchembled's avatar
      storage: start replicating the partition which is furthest behind · 4d3f3723
      Julien Muchembled authored
      This fixes the following case when the backup is far behing the upstream DB,
      and there are transactions being committed at the same time:
      1. replicate partition 0
      2. replicate partition 0
      3. replicate partition 1
      4. replicate partition 0
      5. replicate partition 1
      6. replicate partition 2
      7. replicate partition 0
      and so on in a quadratic way.
      When the upstream activity was too high, the backup could even be stuck looping
      on the first partitions.
  16. 27 Nov, 2016 2 commits
    • Julien Muchembled's avatar
      Fix identification issues, including a race condition causing id conflicts · 9385706f
      Julien Muchembled authored
      The added test describes how the new id timestamps fix the race condition.
      These timestamps could be any unique opaque values, and the protocol is
      extended to exchange them along with node ids.
      Internally, nodes also reuse timestamps as a marker to identify the first
      NotifyNodeInformation packets from the master: since this packet is a complete
      list of nodes in the cluster, any other node in the node manager has left the
      cluster definitely and is removed.
      The secondary masters didn't receive update about master nodes.
      It's also useless to send them information about non-master nodes.
    • Julien Muchembled's avatar
      Fix spelling mistakes · 6e32ebb7
      Julien Muchembled authored
  17. 20 Apr, 2016 1 commit
    • Julien Muchembled's avatar
      storage: fix crash when trying to replicate from an unreachable node · 8b07ff98
      Julien Muchembled authored
      This fixes the following issue:
      WARNING replication aborted for partition 1
      DEBUG   connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d2067fdd0>
      DEBUG   connect failed for <SocketConnectorIPv6 at 0x7f5d2067fe10 fileno 10 ('::', 0), opened to ('...', 43776)>: ENETUNREACH (Network is unreachable)
      WARNING replication aborted for partition 5
      DEBUG   connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d1c409510>
      PACKET  #0x0000 RequestIdentification          > None (...:43776)  | (<EnumItem STORAGE (1)>, None, ('...', 60533), '...')
      ERROR   Pre-mortem data:
      ERROR   Traceback (most recent call last):
      ERROR     File "neo/storage/app.py", line 157, in run
      ERROR       self._run()
      ERROR     File "neo/storage/app.py", line 197, in _run
      ERROR       self.doOperation()
      ERROR     File "neo/storage/app.py", line 285, in doOperation
      ERROR       poll()
      ERROR     File "neo/storage/app.py", line 95, in _poll
      ERROR       self.em.poll(1)
      ERROR     File "neo/lib/event.py", line 121, in poll
      ERROR       self._poll(blocking)
      ERROR     File "neo/lib/event.py", line 165, in _poll
      ERROR       if conn.readable():
      ERROR     File "neo/lib/connection.py", line 481, in readable
      ERROR       self._closure()
      ERROR     File "neo/lib/connection.py", line 539, in _closure
      ERROR       self.close()
      ERROR     File "neo/lib/connection.py", line 531, in close
      ERROR       handler.connectionClosed(self)
      ERROR     File "neo/lib/handler.py", line 135, in connectionClosed
      ERROR       self.connectionLost(conn, NodeStates.TEMPORARILY_DOWN)
      ERROR     File "neo/storage/handlers/storage.py", line 59, in connectionLost
      ERROR       replicator.abort()
      ERROR     File "neo/storage/replicator.py", line 339, in abort
      ERROR       self._nextPartition()
      ERROR     File "neo/storage/replicator.py", line 260, in _nextPartition
      ERROR       None if name else app.uuid, app.server, name or app.name))
      ERROR     File "neo/lib/connection.py", line 562, in ask
      ERROR       raise ConnectionClosed
      ERROR   ConnectionClosed
  18. 25 Jan, 2016 1 commit
  19. 01 Dec, 2015 1 commit
    • Julien Muchembled's avatar
      Safer DB truncation, new 'truncate' ctl command · d3c8b76d
      Julien Muchembled authored
      With the previous commit, the request to truncate the DB was not stored
      persistently, which means that this operation was still vulnerable to the case
      where the master is restarted after some nodes, but not all, have already
      truncated. The master didn't have the information to fix this and the result
      was a DB partially truncated.
      -> On a Truncate packet, a storage node only stores the tid somewhere, to send
         it back to the master, which stays in RECOVERING state as long as any node
         has a different value than that of the node with the latest partition table.
      We also want to make sure that there is no unfinished data, because a user may
      truncate at a tid higher than a locked one.
      -> Truncation is now effective at the end on the VERIFYING phase, just before
         returning the last ids to the master.
      At last all nodes should be truncated, to avoid that an offline node comes back
      with a different history. Currently, this would not be an issue since
      replication is always restart from the beginning, but later we'd like they
      remember where they stopped to replicate.
      -> If a truncation is requested, the master waits for all nodes to be pending,
         even if it was previously started (the user can still force the cluster to
         start with neoctl). And any lost node during verification also causes the
         master to go back to recovery.
      Obviously, the protocol has been changed to split the LastIDs packet and
      introduce a new Recovery, since it does not make sense anymore to ask last ids
      during recovery.
  20. 24 Sep, 2015 1 commit
  21. 14 Aug, 2015 1 commit
    • Julien Muchembled's avatar
      Do not reconnect too quickly to a node after an error · d898a83d
      Julien Muchembled authored
      For example, a backup storage node that was rejected because the upstream
      cluster was not ready could reconnect in loop without delay, using 100% CPU
      and flooding logs.
      A new 'setReconnectionNoDelay' method on Connection can be used for cases where
      it's legitimate to quickly reconnect.
      With this new delayed reconnection, it's possible to remove the remaining
  22. 12 Aug, 2015 1 commit
  23. 21 May, 2015 1 commit
  24. 25 Jul, 2014 1 commit
  25. 07 Jan, 2014 1 commit
  26. 21 Aug, 2012 1 commit
  27. 20 Aug, 2012 3 commits
    • Vincent Pelletier's avatar
      Drop some unused imports. · 4fcd8ddc
      Vincent Pelletier authored
    • Julien Muchembled's avatar
      Comment about backup limitations · dd556379
      Julien Muchembled authored
    • Julien Muchembled's avatar
      More bugfixes to backup mode · 08742377
      Julien Muchembled authored
      - catch OperationFailure
      - reset transaction manager when leaving backup mode
      - send appropriate target tid to a storage that updates a outdated cell
      - clean up partition table when leaving BACKINGUP state unexpectedly
      - make sure all readable cells of a partition have the same 'backup_tid'
        if they have the same data, so that we know when internal replication is
        finished when leaving backup mode
      - fix storage not finished internal replication when leaving backup mode
  28. 14 Aug, 2012 1 commit
  29. 09 Aug, 2012 1 commit
    • Julien Muchembled's avatar
      Backup bugfixes · ad01f379
      Julien Muchembled authored
      - fix stopping backup cluster
      - fix leaving backup mode, including truncating to consistent TID
      - fix backup_tid on master and storages
  30. 26 Jul, 2012 1 commit
  31. 17 Jul, 2012 1 commit
    • Julien Muchembled's avatar
      storage: fix save or reset of 'backup_tid' config value · 1d8a0dbe
      Julien Muchembled authored
      Because masters don't have persistent storage, the task to remember whether
      the cluster is in backup mode or not is delegated to storages, via the presence
      of a 'backup_tid' config value.
      This fixes a bug that set 'backup_tid' after a simple replication.
      If the cluster was restarted, it would have tried to switch to BACKINGUP state.
  32. 20 Mar, 2012 1 commit
  33. 13 Mar, 2012 1 commit