1. 27 Apr, 2019 2 commits
    • Make the number of replicas modifiable when the cluster is running · ef5fc508
      neoctl gets a new command to change the number of replicas.
      
      The number of replicas becomes a new partition table attribute and
      like the PT id, it is stored in the config table. On the other side,
      the configuration value for the number of partitions is dropped,
      since it can be computed from the partition table, which is
      always stored in full.
      
      The -p/-r master options now only apply at database creation.
      
      Some implementation notes:
      
      - The protocol is slightly optimized in that the master now sends
        automatically the whole partition tables to the admin & client
        nodes upon connection, like for storage nodes.
        This makes the protocol more consistent, and the master is the
        only remaining node requesting partition tables, during recovery.
      
      - Some parts become tricky because app.pt can be None in more cases.
        For example, the extra condition in NodeManager.update
        (before app.pt.dropNode) was added for this is the reason.
        Or the 'loadPartitionTable' method (storage) that is not inlined
        because of unit tests.
        Overall, this commit simplifies more than it complicates.
      
      - In the master handlers, we stop hijacking the 'connectionCompleted'
        method for tasks to be performed (often send the full partition
        table) on handler switches.
      
      - The admin's 'bootstrapped' flag could have been removed earlier:
        race conditions can't happen since the AskNodeInformation packet
        was removed (commit d048a52d).
      Julien Muchembled committed
    • New --new-nid storage option for fast cloning · 27e3f620
      It is often faster to set up replicas by stopping a node (and any
      underlying database server like MariaDB) and do a raw copy of the
      database (e.g. with rsync). So far, it required to stop the whole
      cluster and use tools like 'mysql' or sqlite3' to edit:
      - the 'pt' table in databases,
      - the 'config.nid' values of the new nodes.
      
      With this new option, if you already have 1 replica, you can set up
      new replicas with such fast raw copy, and without interruption of
      service. Obviously, this implies less redundancy during the operation.
      Julien Muchembled committed
  2. 26 Apr, 2019 3 commits
  3. 11 Mar, 2019 1 commit
  4. 22 Jun, 2018 1 commit
    • Maximize resiliency by taking into account the topology of storage nodes · 97af23cc
      This commit adds a contraint when tweaking the partition table with replicas,
      so that cells of each partition are assigned as far as possible from each
      other, e.g. not on the same machine even if each one has several disks, and
      in any case not on the same storage device.
      
      Currently, the topology path of each node is automatically calculated by the
      storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
      To be improved:
      - Add a storage option to override the path: the 'tweak' algorithm can already
        handle topology paths of any length, so something like (room, machine, disk)
        could be done easily.
      - Write OS-specific code to determine the real hardware behind st_dev
        (e.g. 2 different 'st_dev' values may actually refer to the same disk,
         because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
         and so on).
      - Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
        if it isn't, the master only logs a WARNING during tweak.
      Julien Muchembled committed
  5. 21 Jun, 2018 1 commit
  6. 04 Jun, 2018 1 commit
  7. 30 May, 2018 1 commit
    • Optimize resumption of replication by starting from a greater TID · b3dd6973
      Although data that are already transferred aren't transferred again, checking
      that the data are there for a whole partition can still be a lot of work for
      big databases. This commit is a major performance improvement in that a storage
      node that gets disconnected for a short time now gets fully operational quite
      instantaneously because it only has to replicate the new data. Before, the time
      to recover depended on the size of the DB.
      
      For OUT_OF_DATE cells, the difficult part was that they are writable and
      can then contain holes, so we can't just take the last TID in trans/obj
      (we wrongly did that at the beginning, and then committed
      6b1f198f as a workaround). We solve that
      by storing up to where it was up-to-date: this value is initialized from
      the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING.
      
      There's actually one such OUT_OF_DATE TID per assigned cell (backends store
      these values in the 'pt' table). Otherwise, a cell that still has a lot to
      replicate would still cause all other cells to resume from the a very small
      TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node
      (as a result of tweak).
      
      For UP_TO_DATE cells of a backup cluster, replication was resumed from the
      maximum TID at which all assigned cells are known to be fully replicated.
      Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of
      extra work for others, the worst case being when setting up a backup cluster
      (it always restarted from ZERO_TID as long as at least 1 cell was still empty).
      Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to
      store extra information: we simply look at the last TIDs in trans/obj.
      We even handle trans & obj independently, to minimize the work in 1 table
      (i.e. trans since it's processed first) if the other is late (obj).
      
      There's a small change in the protocol so that OUT_OF_DATE enum value equals 0.
      This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column
      as the cell state.
      
      Note about MySQL changes in commit ca58ccd7:
      what we did as a workaround is not one any more. Now, we do so much on Python
      side that it's unlikely we could reduce the number of queries using GROUP BY.
      We even stopped doing that for SQLite.
      Julien Muchembled committed
  8. 24 May, 2018 2 commits
  9. 15 May, 2018 1 commit
  10. 11 May, 2018 1 commit
  11. 07 May, 2018 2 commits
  12. 08 Jan, 2018 1 commit
    • storage: optimize storage layout of raw data for replication · f4dd4bab
      # Previous status
      
      The issue was that we had extreme storage fragmentation from the point of view
      of the replication algorithm, which processes one partition at a time.
      
      By using an autoincrement for the 'data' table, rows were ordered by the time
      at which they were added:
      - parts may be the result of replication -> ordered by partition, tid, oid
      - other rows are globally sorted by tid
      
      Which means that when scanning a given partition, many rows were skipped all
      the time:
      - if readahead is bigger enough, the efficiency is 1/N for a node with N
        partitions assigned
      - else, it is worse because it seeks all the time
      
      For huge databases, the replication was horribly slow, in particular from HDD.
      
      # Chosen solution
      
      This commit changes how ids are generated to somehow split 'data'
      per partition. The backend tracks 1 last id per assigned partition, where the
      16 higher bits contains the partition. Keep in mind that the value of id has no
      meaning and it's only chosen for performance reasons. IOW, a row can be
      referred by an oid of a partition different than the 16 higher bits of id:
      - there's no migration needed and the 16 higher bits of all existing rows are 0
      - in case of deduplication, a row can still be shared by different partitions
      
      Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
      on existing databases.
      
      ## Downsides
      
      On insertion, increasing the number of partitions now slows down significantly:
      for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
      partitions, the difference remains negligible. The solution for this issue will
      be to enable to increase the number of partitions efficiently, so that nodes
      can keep a small number of them, even for DB that are expected to grow so much
      that many nodes are added over time: such feature was already considered so
      that users don't have to worry anymore about this obscure setting at database
      creation.
      
      Read performance is only slowed down for applications that read a lot of data
      that were written contiguously, but split in small blocks. A solution is to
      extend ZODB so that the application tells it to chose new oids that will end up
      in the same partition. Like for insertion, there should not be too many
      partitions.
      
      With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
      collect all last ids at startup when there are many partitions.
      
      ## Other advantages
      
      - The storage layout of data is now always the same and does not depend on
        whether rows came from replication or commits.
      - Efficient deletion of partition to free space in-place will be possible.
      
      # Considered alternative
      
      The only serious alternative was to replicate as many partitions as possible at
      the same time, ideally all assigned partitions, but it's not always possible.
      For best performance, it would often require to synchronize new nodes, or even
      all of them, so that thesource nodes don't have to scan 'data' several times.
      
      If existing nodes are kept, all data that aren't copied to the newly added
      nodes have to be skipped. If the number of nodes is multiplied by N, the
      efficiency is 1-1/N at best (synchronized nodes), else it's even worse
      because partitions are somehow shuffled.
      
      Checking/replacing a single node would remain slow when there are several
      source nodes.
      
      At last, such an algorithm would be much more complex and we would not have the
      other advantages listed above.
      Julien Muchembled committed
  13. 05 Jan, 2018 4 commits
  14. 15 Dec, 2017 1 commit
  15. 13 Dec, 2017 1 commit
  16. 05 Dec, 2017 1 commit
  17. 07 Nov, 2017 1 commit
  18. 11 Jul, 2017 1 commit
  19. 15 Jun, 2017 1 commit
  20. 12 Jun, 2017 1 commit
  21. 12 May, 2017 1 commit
  22. 31 Mar, 2017 2 commits
  23. 23 Mar, 2017 1 commit
  24. 17 Mar, 2017 1 commit
  25. 14 Mar, 2017 1 commit
  26. 02 Mar, 2017 1 commit
  27. 21 Feb, 2017 1 commit
    • Implement deadlock avoidance · 092992db
      This is a first version with several optimizations possible:
      - improve EventQueue (or implement a specific queue) to minimize deadlocks
      - turn the RebaseObject packet into a notification
      
      Sorting oids could also be useful to reduce the probability of deadlocks,
      but that would never be enough to avoid them completely, even if there's a
      single storage. For example:
      
      1. C1 does a first store (x or y)
      2. C2 stores x and y; one is delayed
      3. C1 stores the other -> deadlock
         When solving the deadlock, the data of the first store may only
         exist on the storage.
      
      2 functional tests are removed because they're redundant,
      either with ZODB tests or with the new threaded tests.
      Julien Muchembled committed
  28. 18 Jan, 2017 1 commit
  29. 26 Dec, 2016 3 commits