1. 11 Jan, 2021 2 commits
  2. 02 Oct, 2020 1 commit
    • Julien Muchembled's avatar
      Fix handling of -m/--masters arg · fa63d856
      Julien Muchembled authored
      For the master, the purpose of -m/--masters is to specify addresses
      of other master nodes, since its own address is already known via
      -b/--bind. Therefore, an empty value for -m/--masters is valid.
      The user remains free to repeat the -b value in -m.
      More generally, a node may choose to only specify master addresses
      via -D/--dynamic-master-list, so the check that at least one master
      address is specified is moved where the NodeManager is expected to be
  3. 29 Sep, 2020 1 commit
  4. 25 Sep, 2020 1 commit
    • Julien Muchembled's avatar
      New algorithm for deadlock avoidance · 5e7f34d2
      Julien Muchembled authored
      The time complexity of previous one was too bad. With several tens of
      concurrent transactions, we saw commits take minutes to complete and
      the whole application looked frozen.
      This new algorithm is much simpler. Instead of asking the oldest
      transaction to somewhat restart (we used the "rebase" term because
      the concept was similar to what git-rebase does), the storage gives
      it priority and the newest is asked to relock (this request is ignored
      if vote already happened, which means there was actually no deadlock).
      testLocklessWriteDuringConflictResolution was initially more complex
      because Transaction.written (client) ignored KeyError (which is not the
      case anymore since commit 8ef1ddba).
  5. 16 Mar, 2020 1 commit
  6. 14 Feb, 2020 1 commit
    • Julien Muchembled's avatar
      master: fix tpc_finish possibly trying to kill too many nodes after client-storage failures · 82eea0cd
      Julien Muchembled authored
      When concurrent transactions fail with different storages (e.g. only network
      issues between C1-S2 and C2-S1), in such a way that each transaction can be
      committed but not both (or the cluster would be non-operational), and if the
      first transaction is aborted (between tpc_vote and tpc_finish), then the second
      wrongly failed with INCOMPLETE_TRANSACTION.
      And if both transactions could be committed (e.g. more than 1 replica),
      some nodes would be disconnected for nothing.
  7. 10 Jan, 2020 1 commit
    • Julien Muchembled's avatar
      master: fix crash of backup master when disconnected from upstream while serving clients · 7e8ca9ec
      Julien Muchembled authored
      This fixes:
        Traceback (most recent call last):
          File "neo/master/app.py", line 172, in run
          File "neo/master/app.py", line 182, in _run
          File "neo/master/app.py", line 314, in playPrimaryRole
          File "neo/master/backup_app.py", line 101, in provideService
          File "neo/master/app.py", line 474, in changeClusterState
            ) or not node.isClient(), (state, node)
        AssertionError: (<EnumItem STARTING_BACKUP (4)>, <ClientNode(uuid=C1, state=RUNNING, connection=<ServerConnection(nid=C1, address=, handler=ClientReadOnlyServiceHandler, fd=59, on_close=onConnectionClosed, server) at 7f38f5628390>) at 7f38f5628ad0>)
  8. 14 Oct, 2019 1 commit
  9. 16 Aug, 2019 1 commit
    • Julien Muchembled's avatar
      New feature: monitoring · e434c253
      Julien Muchembled authored
      This task is done by the admin node, in 2 possible ways:
      - email notifications, as soon as some state change;
      - new 'neoctl print summary' command that can be used periodically
        to check the health of the database.
      They report the same information.
      About backup clusters:
      The admin of the main cluster also monitors selected backup clusters,
      with the help of their admin nodes.
      Internally, when a backup master node connects to the upstream master node,
      it receives the address of the upstream admin node and forwards it to its
      admin node, which is therefore able to connect to the upstream admin node.
      So the 2 admin nodes remain connected and communicate in 2 ways:
      - the backup node notifies upstream about the health of the backup cluster;
      - the upstream node queries the backup node periodically to check whether
        replication is not too late.
      A few things are hard-coded and we may want to configure them:
      - backup lateness is checked every 10 min;
      - backup is expected to never be late.
      There's also no delay to prevent 2 consecutive emails from having the same
      Date: (unfortunately, the RFC 5322 does not allow sub-second precision),
      in which case the MUA can display them in random order. This is mostly
      confusing when one notification is OK and the other is not, because one
      may wonder if there's a new problem.
  10. 05 Jun, 2019 1 commit
    • Julien Muchembled's avatar
      Introduce extra node properties · 82c142c4
      Julien Muchembled authored
      Explicit fields in RequestIdentification are only suitable for the actual
      identification or for properties that most nodes have.
      But some current (and future) features require to pass values (always and
      as soon as possible) for tasks that are unrelated to identification.
  11. 30 Apr, 2019 1 commit
    • Julien Muchembled's avatar
      master: fix crash in STARTING_BACKUP when connecting to an upstream secondary master · dba07e72
      Julien Muchembled authored
      This fixes the following assertion:
        Traceback (most recent call last):
          File "neo/master/app.py", line 172, in run
          File "neo/master/app.py", line 182, in _run
          File "neo/master/app.py", line 302, in playPrimaryRole
          File "neo/master/backup_app.py", line 114, in provideService
            node, conn = bootstrap.getPrimaryConnection()
          File "neo/lib/bootstrap.py", line 74, in getPrimaryConnection
          File "neo/lib/event.py", line 160, in poll
          File "neo/lib/connection.py", line 504, in process
            self._handlers.handle(self, self._queue.pop(0))
          File "neo/lib/connection.py", line 92, in handle
            self._handle(connection, packet)
          File "neo/lib/connection.py", line 107, in _handle
            pending[0][1].packetReceived(connection, packet)
          File "neo/lib/handler.py", line 125, in packetReceived
          File "neo/lib/handler.py", line 75, in dispatch
            method(conn, *args, **kw)
          File "neo/lib/handler.py", line 159, in notPrimaryMaster
            assert primary != self.app.server
        AttributeError: 'BackupApplication' object has no attribute 'server'
  12. 28 Apr, 2019 1 commit
    • Julien Muchembled's avatar
      protocol: switch to msgpack for packet serialization · 9d0bf97a
      Julien Muchembled authored
      Not only for performance reasons (at least 3% faster) but also because of
      several ugly things in the way packets were defined:
      - packet field names, which are only documentary; for roots fields,
        they even just duplicate the packet names
      - a lot of repetitions for packet names, and even confusion between the name
        of the packet definition and the name of the actual notify/request packet
      - the need to implement field types for anything, like PByte to support new
        compression formats, since PBoolean is not enough
      neo/lib/protocol.py is now much smaller.
  13. 27 Apr, 2019 8 commits
    • Julien Muchembled's avatar
      master: reject drop/tweak ctl commands that could lead to unwanted status · 55a6dd0f
      Julien Muchembled authored
      The following 2 operations can be onerous and they should not be
      directly usable without some kind of confirmation by the user:
      - Dropping a node now requires to first stop it.
      - Tweaking does not exclude anymore automatically DOWN nodes,
        because a node could go DOWN between the moment the user sends
        the command to tweak and the actual tweak by the master.
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      tweak: add option to simulate · 2a27239d
      Julien Muchembled authored
      Initially, I wanted to do the simulation inside neoctl but it has no knowledge
      of the topology (the master don't send devpath values of storage nodes).
      Therefore, the work is delegated to the master node, which implies a change
      of the protocol.
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      Better error reporting from the master to neoctl for denied requests · c2c9e99d
      Julien Muchembled authored
      This stops abusing ProtocolError, which disconnects the admin node needlessly.
      The many 'if ... raise RuntimeError' in neo/neoctl/neoctl.py
      could be turned into assertions.
    • Julien Muchembled's avatar
      Make the number of replicas modifiable when the cluster is running · ef5fc508
      Julien Muchembled authored
      neoctl gets a new command to change the number of replicas.
      The number of replicas becomes a new partition table attribute and
      like the PT id, it is stored in the config table. On the other side,
      the configuration value for the number of partitions is dropped,
      since it can be computed from the partition table, which is
      always stored in full.
      The -p/-r master options now only apply at database creation.
      Some implementation notes:
      - The protocol is slightly optimized in that the master now sends
        automatically the whole partition tables to the admin & client
        nodes upon connection, like for storage nodes.
        This makes the protocol more consistent, and the master is the
        only remaining node requesting partition tables, during recovery.
      - Some parts become tricky because app.pt can be None in more cases.
        For example, the extra condition in NodeManager.update
        (before app.pt.dropNode) was added for this is the reason.
        Or the 'loadPartitionTable' method (storage) that is not inlined
        because of unit tests.
        Overall, this commit simplifies more than it complicates.
      - In the master handlers, we stop hijacking the 'connectionCompleted'
        method for tasks to be performed (often send the full partition
        table) on handler switches.
      - The admin's 'bootstrapped' flag could have been removed earlier:
        race conditions can't happen since the AskNodeInformation packet
        was removed (commit d048a52d).
    • Julien Muchembled's avatar
      New --new-nid storage option for fast cloning · 27e3f620
      Julien Muchembled authored
      It is often faster to set up replicas by stopping a node (and any
      underlying database server like MariaDB) and do a raw copy of the
      database (e.g. with rsync). So far, it required to stop the whole
      cluster and use tools like 'mysql' or sqlite3' to edit:
      - the 'pt' table in databases,
      - the 'config.nid' values of the new nodes.
      With this new option, if you already have 1 replica, you can set up
      new replicas with such fast raw copy, and without interruption of
      service. Obviously, this implies less redundancy during the operation.
  14. 21 Mar, 2019 1 commit
  15. 11 Mar, 2019 1 commit
  16. 26 Feb, 2019 1 commit
  17. 31 Dec, 2018 1 commit
  18. 05 Dec, 2018 1 commit
  19. 07 Nov, 2018 2 commits
  20. 07 Aug, 2018 1 commit
    • Julien Muchembled's avatar
      Use argparse instead of optparse · 9f1e4eef
      Julien Muchembled authored
      Besides the use of another module for option parsing, the main change is that
      there's no more Config class that mixes configuration for different components.
      Application classes now takes a simple 'dict' with parsed values.
      The changes in 'neoctl' are somewhat ugly, because command-line options are not
      defined on the command-line class, but this component is likely to disappear
      in the future.
      It remains possible to pass options via a configuration file. The code is a bit
      complex but isolated in neo.lib.config
      For SSL, the code may be simpler if we change for a single --ssl option that
      takes 3 paths. Not done to not break compatibility. Hence, the hack with
      an extra OptionList class in neo.lib.app
      A new functional test tests the 'neomigrate' script, instead of just the
      internal API to migrate data.
  21. 22 Jun, 2018 1 commit
    • Julien Muchembled's avatar
      Maximize resiliency by taking into account the topology of storage nodes · 97af23cc
      Julien Muchembled authored
      This commit adds a contraint when tweaking the partition table with replicas,
      so that cells of each partition are assigned as far as possible from each
      other, e.g. not on the same machine even if each one has several disks, and
      in any case not on the same storage device.
      Currently, the topology path of each node is automatically calculated by the
      storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
      To be improved:
      - Add a storage option to override the path: the 'tweak' algorithm can already
        handle topology paths of any length, so something like (room, machine, disk)
        could be done easily.
      - Write OS-specific code to determine the real hardware behind st_dev
        (e.g. 2 different 'st_dev' values may actually refer to the same disk,
         because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
         and so on).
      - Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
        if it isn't, the master only logs a WARNING during tweak.
  22. 29 Mar, 2018 1 commit
    • Julien Muchembled's avatar
      master: automatically discard feeding cells that get out-of-date · 3efbbfe3
      Julien Muchembled authored
      This is a follow-up of commit 2ca7c335,
      which changed 'tweak' not to discard readable cells too quickly.
      The scenario of a storage being lost whereas it has feeding cells was forgotten.
      These must be discarded immediately, otherwise we end up with more up-to-date
      cells than wanted. Without the change in outdate(), testSafeTweak would end
      with: UU.|U.U|UUU
      Once replication is optimized not to always restart checking cells from the
      - Remembering that an out-of-date cell was feeding could be a safer
        option, but it may not be worth the extra complexity.
      - Another possibility may be to replace the FEEDING state by an automatic
        partial tweak that only discards up-to-date cells too many whenever a cell
        becomes up-to-date.
  23. 02 Mar, 2018 3 commits
    • Julien Muchembled's avatar
      master: fix resumption of backup replication (internal or not) · 27229793
      Julien Muchembled authored
      Before, it waited for upstream activity until all partitions are touched.
      However, when upstream is idle the backup cluster could remain stuck forever
      if it was interrupted whereas some cells were still late.
    • Julien Muchembled's avatar
      master: fix/simplify generation of TID · 7b2e6752
      Julien Muchembled authored
      The 'min_tid < new_tid' assertion failed when jumping to the past.
    • Julien Muchembled's avatar
      master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061
      Julien Muchembled authored
      Given that:
      - read locks are only taken by transactions (not replication)
      - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
        are synchronized up to different tids
      there was a race condition with the master node replying to LastTransaction
      with a TID that may not be replicated yet by all replicas, potentially causing
      such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
      too early.
      IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
      it is only readable by NEO clients up to `getBackupTid(min)` as long as the
      cluster is in BACKINGUP state.
  24. 12 Jun, 2017 1 commit
    • Julien Muchembled's avatar
      master: improve algorithm to tweak the partition table · 2ca7c335
      Julien Muchembled authored
      The most important change is that it does not discard readable cells too
      quickly anymore. A partition can now have multiple FEEDING cells, to avoid
      going below the wanted level of replication.
      The new algorithm is also better at minimizing the amount replication.
  25. 12 May, 2017 1 commit
    • Julien Muchembled's avatar
      Remove packet timeouts · f6eb02b4
      Julien Muchembled authored
      Since it's not worth anymore to keep track of the last connection activity
      (which, btw, ignored TCP ACKs, i.e. timeouts could theorically be triggered
      before all the data were actually sent), the semantics of closeClient has also
      changed. Before this commit, the 1-minute timeout was reset whenever there was
      activity (connection still used as server). Now, it happens exactly 100 seconds
      after the connection is not used anymore as client.
  26. 10 May, 2017 1 commit
  27. 02 May, 2017 1 commit
    • Julien Muchembled's avatar
      master: fix identification of unknown masters · fbcf9c50
      Julien Muchembled authored
      This fixes the following crash:
        Traceback (most recent call last):
          File "neo/master/handlers/identification.py", line 94, in requestIdentification
            uuid = app.getNewUUID(uuid, address, node_type)
          File "neo/master/app.py", line 449, in getNewUUID
            assert uuid != self.uuid
  28. 27 Apr, 2017 1 commit
  29. 25 Apr, 2017 1 commit