1. 27 Apr, 2019 6 commits
    • Julien Muchembled's avatar
      master: reject drop/tweak ctl commands that could lead to unwanted status · 55a6dd0f
      Julien Muchembled authored
      The following 2 operations can be onerous and they should not be
      directly usable without some kind of confirmation by the user:
      - Dropping a node now requires to first stop it.
      - Tweaking does not exclude anymore automatically DOWN nodes,
        because a node could go DOWN between the moment the user sends
        the command to tweak and the actual tweak by the master.
    • Julien Muchembled's avatar
      tweak: add option to simulate · 2a27239d
      Julien Muchembled authored
      Initially, I wanted to do the simulation inside neoctl but it has no knowledge
      of the topology (the master don't send devpath values of storage nodes).
      Therefore, the work is delegated to the master node, which implies a change
      of the protocol.
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      Better error reporting from the master to neoctl for denied requests · c2c9e99d
      Julien Muchembled authored
      This stops abusing ProtocolError, which disconnects the admin node needlessly.
      The many 'if ... raise RuntimeError' in neo/neoctl/neoctl.py
      could be turned into assertions.
    • Julien Muchembled's avatar
      Make the number of replicas modifiable when the cluster is running · ef5fc508
      Julien Muchembled authored
      neoctl gets a new command to change the number of replicas.
      The number of replicas becomes a new partition table attribute and
      like the PT id, it is stored in the config table. On the other side,
      the configuration value for the number of partitions is dropped,
      since it can be computed from the partition table, which is
      always stored in full.
      The -p/-r master options now only apply at database creation.
      Some implementation notes:
      - The protocol is slightly optimized in that the master now sends
        automatically the whole partition tables to the admin & client
        nodes upon connection, like for storage nodes.
        This makes the protocol more consistent, and the master is the
        only remaining node requesting partition tables, during recovery.
      - Some parts become tricky because app.pt can be None in more cases.
        For example, the extra condition in NodeManager.update
        (before app.pt.dropNode) was added for this is the reason.
        Or the 'loadPartitionTable' method (storage) that is not inlined
        because of unit tests.
        Overall, this commit simplifies more than it complicates.
      - In the master handlers, we stop hijacking the 'connectionCompleted'
        method for tasks to be performed (often send the full partition
        table) on handler switches.
      - The admin's 'bootstrapped' flag could have been removed earlier:
        race conditions can't happen since the AskNodeInformation packet
        was removed (commit d048a52d).
    • Julien Muchembled's avatar
      New --new-nid storage option for fast cloning · 27e3f620
      Julien Muchembled authored
      It is often faster to set up replicas by stopping a node (and any
      underlying database server like MariaDB) and do a raw copy of the
      database (e.g. with rsync). So far, it required to stop the whole
      cluster and use tools like 'mysql' or sqlite3' to edit:
      - the 'pt' table in databases,
      - the 'config.nid' values of the new nodes.
      With this new option, if you already have 1 replica, you can set up
      new replicas with such fast raw copy, and without interruption of
      service. Obviously, this implies less redundancy during the operation.
  2. 11 Mar, 2019 1 commit
  3. 26 Feb, 2019 1 commit
  4. 07 Nov, 2018 2 commits
  5. 22 Jun, 2018 1 commit
    • Julien Muchembled's avatar
      Maximize resiliency by taking into account the topology of storage nodes · 97af23cc
      Julien Muchembled authored
      This commit adds a contraint when tweaking the partition table with replicas,
      so that cells of each partition are assigned as far as possible from each
      other, e.g. not on the same machine even if each one has several disks, and
      in any case not on the same storage device.
      Currently, the topology path of each node is automatically calculated by the
      storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
      To be improved:
      - Add a storage option to override the path: the 'tweak' algorithm can already
        handle topology paths of any length, so something like (room, machine, disk)
        could be done easily.
      - Write OS-specific code to determine the real hardware behind st_dev
        (e.g. 2 different 'st_dev' values may actually refer to the same disk,
         because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
         and so on).
      - Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
        if it isn't, the master only logs a WARNING during tweak.
  6. 02 Mar, 2018 2 commits
    • Julien Muchembled's avatar
      master: fix resumption of backup replication (internal or not) · 27229793
      Julien Muchembled authored
      Before, it waited for upstream activity until all partitions are touched.
      However, when upstream is idle the backup cluster could remain stuck forever
      if it was interrupted whereas some cells were still late.
    • Julien Muchembled's avatar
      master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061
      Julien Muchembled authored
      Given that:
      - read locks are only taken by transactions (not replication)
      - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
        are synchronized up to different tids
      there was a race condition with the master node replying to LastTransaction
      with a TID that may not be replicated yet by all replicas, potentially causing
      such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
      too early.
      IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
      it is only readable by NEO clients up to `getBackupTid(min)` as long as the
      cluster is in BACKINGUP state.
  7. 12 Jun, 2017 1 commit
    • Julien Muchembled's avatar
      master: improve algorithm to tweak the partition table · 2ca7c335
      Julien Muchembled authored
      The most important change is that it does not discard readable cells too
      quickly anymore. A partition can now have multiple FEEDING cells, to avoid
      going below the wanted level of replication.
      The new algorithm is also better at minimizing the amount replication.
  8. 12 May, 2017 1 commit
    • Julien Muchembled's avatar
      Remove packet timeouts · f6eb02b4
      Julien Muchembled authored
      Since it's not worth anymore to keep track of the last connection activity
      (which, btw, ignored TCP ACKs, i.e. timeouts could theorically be triggered
      before all the data were actually sent), the semantics of closeClient has also
      changed. Before this commit, the 1-minute timeout was reset whenever there was
      activity (connection still used as server). Now, it happens exactly 100 seconds
      after the connection is not used anymore as client.
  9. 02 May, 2017 1 commit
    • Julien Muchembled's avatar
      master: fix identification of unknown masters · fbcf9c50
      Julien Muchembled authored
      This fixes the following crash:
        Traceback (most recent call last):
          File "neo/master/handlers/identification.py", line 94, in requestIdentification
            uuid = app.getNewUUID(uuid, address, node_type)
          File "neo/master/app.py", line 449, in getNewUUID
            assert uuid != self.uuid
  10. 27 Apr, 2017 1 commit
  11. 25 Apr, 2017 2 commits
  12. 24 Apr, 2017 3 commits
    • Julien Muchembled's avatar
      Reimplement election (of the primary master) · 23b6a66a
      Julien Muchembled authored
      The election is not a separate process anymore.
      It happens during the RECOVERING phase, and there's no use of timeouts anymore.
      Each master node keeps a timestamp of when it started to play the primary role,
      and the node with the smallest timestamp is elected. The election stops when
      the cluster is started: as long as it is operational, the primary master can't
      be deposed.
      An election must happen whenever the cluster is not operational anymore, to
      handle the case of a network cut between a primary master and all other nodes:
      then another master node (secondary) takes over and when the initial primary
      master is back, it loses against the new primary master if the cluster is
      already started.
    • Julien Muchembled's avatar
      Remove BROKEN node state · 9d7f9795
      Julien Muchembled authored
    • Julien Muchembled's avatar
      On NM update, fix removal of nodes that aren't part of the cluster anymore · f051b7a0
      Julien Muchembled authored
      In order to do that correctly, this commit contains several other changes:
      When connecting to a primary master, a full node list always follows the
      identification. For storage nodes, this means that they now know all nodes
      during the RECOVERING phase.
      The initial full node list now always contains a node tuple for:
      - the server-side node (i.e. the primary master): on a master, this is
        done by always having a node describing itself in its node manager.
      - the client-side node, to make sure it gets a id timestamp:
        now an admin node also receives a node for itself.
  13. 31 Mar, 2017 3 commits
    • Julien Muchembled's avatar
      Fix race when tweak touches partitions that are being reported as replicated · 87c5178b
      Julien Muchembled authored
      The bug could lead to data corruption (if a partition is wrongly marked as
      UP_TO_DATE) or crashes (assertion failure on either the storage or the master).
      The protocol is extended to handle the following scenario:
          S                                    M
          partition 0 outdated
            <-- UnfinishedTransactions ------>
          replication of partition 0 ...
          partition 1 outdated
            --- UnfinishedTransactions ...
          ... replication finished
            --- ReplicationDone ...
            <-- partition 1 discarded --------
            <-- partition 1 outdated ---------
                ... UnfinishedTransactions -->
                ... ReplicationDone --------->
      The master can't simply mark all outdated cells as being updatable when it
      receives an UnfinishedTransactions packet.
    • Julien Muchembled's avatar
      Forbid read-accesses to cells that are actually non-readable · 64afd7d2
      Julien Muchembled authored
      After an attempt to read from a non-readable, which happens when a client has
      a newer or older PT than storage's, the client now retries to read.
      This bugfix is for all kinds of read-access except undoLog, which can still
      report incomplete results.
    • Julien Muchembled's avatar
  14. 23 Mar, 2017 2 commits
  15. 18 Mar, 2017 1 commit
    • Julien Muchembled's avatar
      master: fix crash when a transaction begins while a storage node starts operation · 781b4eb5
      Julien Muchembled authored
      Traceback (most recent call last):
        File "neo/lib/handler.py", line 72, in dispatch
          method(conn, *args, **kw)
        File "neo/master/handlers/client.py", line 70, in askFinishTransaction
        File "neo/master/transactions.py", line 387, in prepare
          assert node_list, (ready, failed)
      AssertionError: (set([]), frozenset([]))
      Master log leading to the crash:
        PACKET    #0x0009 StartOperation                 > S1
        PACKET    #0x0004 BeginTransaction               < C1
        DEBUG     Begin <...>
        PACKET    #0x0004 AnswerBeginTransaction         > C1
        PACKET    #0x0001 NotifyReady                    < S1
      It was wrong to process BeginTransaction before receiving NotifyReady.
      The changes in the storage are cosmetics: the 'ready' attribute has become
      redundant with 'operational'.
  16. 21 Feb, 2017 2 commits
    • Julien Muchembled's avatar
      Implement deadlock avoidance · 092992db
      Julien Muchembled authored
      This is a first version with several optimizations possible:
      - improve EventQueue (or implement a specific queue) to minimize deadlocks
      - turn the RebaseObject packet into a notification
      Sorting oids could also be useful to reduce the probability of deadlocks,
      but that would never be enough to avoid them completely, even if there's a
      single storage. For example:
      1. C1 does a first store (x or y)
      2. C2 stores x and y; one is delayed
      3. C1 stores the other -> deadlock
         When solving the deadlock, the data of the first store may only
         exist on the storage.
      2 functional tests are removed because they're redundant,
      either with ZODB tests or with the new threaded tests.
    • Julien Muchembled's avatar
      Fixes/improvements to EventQueue · cc8d0a7c
      Julien Muchembled authored
      - Make sure that errors while processing a delayed packet are reported to the
        connection that sent this packet.
      - Provide a mechanism to process events for the same connection in
        chronological order.
  17. 14 Feb, 2017 2 commits
  18. 02 Feb, 2017 1 commit
  19. 18 Jan, 2017 1 commit
  20. 17 Jan, 2017 1 commit
  21. 04 Jan, 2017 1 commit
    • Julien Muchembled's avatar
      qa: rewrite testReplicationBlockedByUnfinished as a threaded test · d3cb8888
      Julien Muchembled authored
      It is extended to check that the storage is only notified about the
      transactions that existed at the time it asked for them. Otherwise,
      Replicator.transactionFinished would be called more than once, and
      `self.ttid_set.remove(ttid)` would raise KeyError.
      The functional version also contained an annoying 'sleep(10)'.
  22. 22 Dec, 2016 1 commit
  23. 21 Dec, 2016 1 commit
    • Julien Muchembled's avatar
      master: fix possibly wrong knowledge of cells' backup_tid when resuming backup · 17af3b47
      Julien Muchembled authored
      The issue happens when there were commits while the backup cluster was down.
      In this case, the master thinks that these commits are already replicated,
      reporting wrong backup_tid to neoctl. It solved by itself once:
      - there are new commits triggering replication for all partitions;
      - all storage nodes have really replicated.
      This also resulted in an inconsistent database when leaving backup mode during
      this period.
  24. 06 Dec, 2016 1 commit
    • Julien Muchembled's avatar
      master,client: ignore notifications before complete initialization · 36b2d141
      Julien Muchembled authored
      A backup master crashed with the following traceback after a reconnection:
          Traceback (most recent call last):
            File "neo/master/app.py", line 127, in run
            File "neo/master/app.py", line 147, in _run
            File "neo/master/app.py", line 348, in playPrimaryRole
            File "neo/master/backup_app.py", line 123, in provideService
            File "neo/lib/event.py", line 126, in poll
            File "neo/lib/connection.py", line 500, in process
              self._handlers.handle(self, self._queue.pop(0))
            File "neo/lib/connection.py", line 110, in handle
              self._handle(connection, packet)
            File "neo/lib/connection.py", line 125, in _handle
              handler.packetReceived(connection, packet)
            File "neo/lib/handler.py", line 117, in packetReceived
            File "neo/lib/handler.py", line 66, in dispatch
              method(conn, *args, **kw)
            File "neo/master/handlers/backup.py", line 52, in invalidateObjects
              app.invalidatePartitions(tid, partition_set)
            File "neo/master/backup_app.py", line 257, in invalidatePartitions
            File "neo/master/backup_app.py", line 281, in triggerBackup
              assert cell_list, offset
          AssertionError: 0
  25. 27 Nov, 2016 1 commit
    • Julien Muchembled's avatar
      Fix identification issues, including a race condition causing id conflicts · 9385706f
      Julien Muchembled authored
      The added test describes how the new id timestamps fix the race condition.
      These timestamps could be any unique opaque values, and the protocol is
      extended to exchange them along with node ids.
      Internally, nodes also reuse timestamps as a marker to identify the first
      NotifyNodeInformation packets from the master: since this packet is a complete
      list of nodes in the cluster, any other node in the node manager has left the
      cluster definitely and is removed.
      The secondary masters didn't receive update about master nodes.
      It's also useless to send them information about non-master nodes.