1. 27 Apr, 2019 8 commits
    • Julien Muchembled's avatar
      master: reject drop/tweak ctl commands that could lead to unwanted status · 55a6dd0f
      Julien Muchembled authored
      The following 2 operations can be onerous and they should not be
      directly usable without some kind of confirmation by the user:
      - Dropping a node now requires to first stop it.
      - Tweaking does not exclude anymore automatically DOWN nodes,
        because a node could go DOWN between the moment the user sends
        the command to tweak and the actual tweak by the master.
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      tweak: add option to simulate · 2a27239d
      Julien Muchembled authored
      Initially, I wanted to do the simulation inside neoctl but it has no knowledge
      of the topology (the master don't send devpath values of storage nodes).
      Therefore, the work is delegated to the master node, which implies a change
      of the protocol.
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      Better error reporting from the master to neoctl for denied requests · c2c9e99d
      Julien Muchembled authored
      This stops abusing ProtocolError, which disconnects the admin node needlessly.
      The many 'if ... raise RuntimeError' in neo/neoctl/neoctl.py
      could be turned into assertions.
    • Julien Muchembled's avatar
      Make the number of replicas modifiable when the cluster is running · ef5fc508
      Julien Muchembled authored
      neoctl gets a new command to change the number of replicas.
      The number of replicas becomes a new partition table attribute and
      like the PT id, it is stored in the config table. On the other side,
      the configuration value for the number of partitions is dropped,
      since it can be computed from the partition table, which is
      always stored in full.
      The -p/-r master options now only apply at database creation.
      Some implementation notes:
      - The protocol is slightly optimized in that the master now sends
        automatically the whole partition tables to the admin & client
        nodes upon connection, like for storage nodes.
        This makes the protocol more consistent, and the master is the
        only remaining node requesting partition tables, during recovery.
      - Some parts become tricky because app.pt can be None in more cases.
        For example, the extra condition in NodeManager.update
        (before app.pt.dropNode) was added for this is the reason.
        Or the 'loadPartitionTable' method (storage) that is not inlined
        because of unit tests.
        Overall, this commit simplifies more than it complicates.
      - In the master handlers, we stop hijacking the 'connectionCompleted'
        method for tasks to be performed (often send the full partition
        table) on handler switches.
      - The admin's 'bootstrapped' flag could have been removed earlier:
        race conditions can't happen since the AskNodeInformation packet
        was removed (commit d048a52d).
    • Julien Muchembled's avatar
      New --new-nid storage option for fast cloning · 27e3f620
      Julien Muchembled authored
      It is often faster to set up replicas by stopping a node (and any
      underlying database server like MariaDB) and do a raw copy of the
      database (e.g. with rsync). So far, it required to stop the whole
      cluster and use tools like 'mysql' or sqlite3' to edit:
      - the 'pt' table in databases,
      - the 'config.nid' values of the new nodes.
      With this new option, if you already have 1 replica, you can set up
      new replicas with such fast raw copy, and without interruption of
      service. Obviously, this implies less redundancy during the operation.
  2. 21 Mar, 2019 1 commit
  3. 11 Mar, 2019 1 commit
  4. 26 Feb, 2019 1 commit
  5. 31 Dec, 2018 1 commit
  6. 05 Dec, 2018 1 commit
  7. 07 Nov, 2018 2 commits
  8. 07 Aug, 2018 1 commit
    • Julien Muchembled's avatar
      Use argparse instead of optparse · 9f1e4eef
      Julien Muchembled authored
      Besides the use of another module for option parsing, the main change is that
      there's no more Config class that mixes configuration for different components.
      Application classes now takes a simple 'dict' with parsed values.
      The changes in 'neoctl' are somewhat ugly, because command-line options are not
      defined on the command-line class, but this component is likely to disappear
      in the future.
      It remains possible to pass options via a configuration file. The code is a bit
      complex but isolated in neo.lib.config
      For SSL, the code may be simpler if we change for a single --ssl option that
      takes 3 paths. Not done to not break compatibility. Hence, the hack with
      an extra OptionList class in neo.lib.app
      A new functional test tests the 'neomigrate' script, instead of just the
      internal API to migrate data.
  9. 22 Jun, 2018 1 commit
    • Julien Muchembled's avatar
      Maximize resiliency by taking into account the topology of storage nodes · 97af23cc
      Julien Muchembled authored
      This commit adds a contraint when tweaking the partition table with replicas,
      so that cells of each partition are assigned as far as possible from each
      other, e.g. not on the same machine even if each one has several disks, and
      in any case not on the same storage device.
      Currently, the topology path of each node is automatically calculated by the
      storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
      To be improved:
      - Add a storage option to override the path: the 'tweak' algorithm can already
        handle topology paths of any length, so something like (room, machine, disk)
        could be done easily.
      - Write OS-specific code to determine the real hardware behind st_dev
        (e.g. 2 different 'st_dev' values may actually refer to the same disk,
         because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
         and so on).
      - Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
        if it isn't, the master only logs a WARNING during tweak.
  10. 29 Mar, 2018 1 commit
    • Julien Muchembled's avatar
      master: automatically discard feeding cells that get out-of-date · 3efbbfe3
      Julien Muchembled authored
      This is a follow-up of commit 2ca7c335,
      which changed 'tweak' not to discard readable cells too quickly.
      The scenario of a storage being lost whereas it has feeding cells was forgotten.
      These must be discarded immediately, otherwise we end up with more up-to-date
      cells than wanted. Without the change in outdate(), testSafeTweak would end
      with: UU.|U.U|UUU
      Once replication is optimized not to always restart checking cells from the
      - Remembering that an out-of-date cell was feeding could be a safer
        option, but it may not be worth the extra complexity.
      - Another possibility may be to replace the FEEDING state by an automatic
        partial tweak that only discards up-to-date cells too many whenever a cell
        becomes up-to-date.
  11. 02 Mar, 2018 3 commits
    • Julien Muchembled's avatar
      master: fix resumption of backup replication (internal or not) · 27229793
      Julien Muchembled authored
      Before, it waited for upstream activity until all partitions are touched.
      However, when upstream is idle the backup cluster could remain stuck forever
      if it was interrupted whereas some cells were still late.
    • Julien Muchembled's avatar
      master: fix/simplify generation of TID · 7b2e6752
      Julien Muchembled authored
      The 'min_tid < new_tid' assertion failed when jumping to the past.
    • Julien Muchembled's avatar
      master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061
      Julien Muchembled authored
      Given that:
      - read locks are only taken by transactions (not replication)
      - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
        are synchronized up to different tids
      there was a race condition with the master node replying to LastTransaction
      with a TID that may not be replicated yet by all replicas, potentially causing
      such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
      too early.
      IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
      it is only readable by NEO clients up to `getBackupTid(min)` as long as the
      cluster is in BACKINGUP state.
  12. 12 Jun, 2017 1 commit
    • Julien Muchembled's avatar
      master: improve algorithm to tweak the partition table · 2ca7c335
      Julien Muchembled authored
      The most important change is that it does not discard readable cells too
      quickly anymore. A partition can now have multiple FEEDING cells, to avoid
      going below the wanted level of replication.
      The new algorithm is also better at minimizing the amount replication.
  13. 12 May, 2017 1 commit
    • Julien Muchembled's avatar
      Remove packet timeouts · f6eb02b4
      Julien Muchembled authored
      Since it's not worth anymore to keep track of the last connection activity
      (which, btw, ignored TCP ACKs, i.e. timeouts could theorically be triggered
      before all the data were actually sent), the semantics of closeClient has also
      changed. Before this commit, the 1-minute timeout was reset whenever there was
      activity (connection still used as server). Now, it happens exactly 100 seconds
      after the connection is not used anymore as client.
  14. 10 May, 2017 1 commit
  15. 02 May, 2017 1 commit
    • Julien Muchembled's avatar
      master: fix identification of unknown masters · fbcf9c50
      Julien Muchembled authored
      This fixes the following crash:
        Traceback (most recent call last):
          File "neo/master/handlers/identification.py", line 94, in requestIdentification
            uuid = app.getNewUUID(uuid, address, node_type)
          File "neo/master/app.py", line 449, in getNewUUID
            assert uuid != self.uuid
  16. 27 Apr, 2017 1 commit
  17. 25 Apr, 2017 2 commits
  18. 24 Apr, 2017 3 commits
    • Julien Muchembled's avatar
      Reimplement election (of the primary master) · 23b6a66a
      Julien Muchembled authored
      The election is not a separate process anymore.
      It happens during the RECOVERING phase, and there's no use of timeouts anymore.
      Each master node keeps a timestamp of when it started to play the primary role,
      and the node with the smallest timestamp is elected. The election stops when
      the cluster is started: as long as it is operational, the primary master can't
      be deposed.
      An election must happen whenever the cluster is not operational anymore, to
      handle the case of a network cut between a primary master and all other nodes:
      then another master node (secondary) takes over and when the initial primary
      master is back, it loses against the new primary master if the cluster is
      already started.
    • Julien Muchembled's avatar
      Remove BROKEN node state · 9d7f9795
      Julien Muchembled authored
    • Julien Muchembled's avatar
      On NM update, fix removal of nodes that aren't part of the cluster anymore · f051b7a0
      Julien Muchembled authored
      In order to do that correctly, this commit contains several other changes:
      When connecting to a primary master, a full node list always follows the
      identification. For storage nodes, this means that they now know all nodes
      during the RECOVERING phase.
      The initial full node list now always contains a node tuple for:
      - the server-side node (i.e. the primary master): on a master, this is
        done by always having a node describing itself in its node manager.
      - the client-side node, to make sure it gets a id timestamp:
        now an admin node also receives a node for itself.
  19. 31 Mar, 2017 5 commits
  20. 23 Mar, 2017 2 commits
  21. 18 Mar, 2017 1 commit
    • Julien Muchembled's avatar
      master: fix crash when a transaction begins while a storage node starts operation · 781b4eb5
      Julien Muchembled authored
      Traceback (most recent call last):
        File "neo/lib/handler.py", line 72, in dispatch
          method(conn, *args, **kw)
        File "neo/master/handlers/client.py", line 70, in askFinishTransaction
        File "neo/master/transactions.py", line 387, in prepare
          assert node_list, (ready, failed)
      AssertionError: (set([]), frozenset([]))
      Master log leading to the crash:
        PACKET    #0x0009 StartOperation                 > S1
        PACKET    #0x0004 BeginTransaction               < C1
        DEBUG     Begin <...>
        PACKET    #0x0004 AnswerBeginTransaction         > C1
        PACKET    #0x0001 NotifyReady                    < S1
      It was wrong to process BeginTransaction before receiving NotifyReady.
      The changes in the storage are cosmetics: the 'ready' attribute has become
      redundant with 'operational'.
  22. 14 Mar, 2017 1 commit