- 27 Apr, 2019 2 commits
-
-
Julien Muchembled authored
neoctl gets a new command to change the number of replicas. The number of replicas becomes a new partition table attribute and like the PT id, it is stored in the config table. On the other side, the configuration value for the number of partitions is dropped, since it can be computed from the partition table, which is always stored in full. The -p/-r master options now only apply at database creation. Some implementation notes: - The protocol is slightly optimized in that the master now sends automatically the whole partition tables to the admin & client nodes upon connection, like for storage nodes. This makes the protocol more consistent, and the master is the only remaining node requesting partition tables, during recovery. - Some parts become tricky because app.pt can be None in more cases. For example, the extra condition in NodeManager.update (before app.pt.dropNode) was added for this is the reason. Or the 'loadPartitionTable' method (storage) that is not inlined because of unit tests. Overall, this commit simplifies more than it complicates. - In the master handlers, we stop hijacking the 'connectionCompleted' method for tasks to be performed (often send the full partition table) on handler switches. - The admin's 'bootstrapped' flag could have been removed earlier: race conditions can't happen since the AskNodeInformation packet was removed (commit d048a52d).
-
Julien Muchembled authored
It is often faster to set up replicas by stopping a node (and any underlying database server like MariaDB) and do a raw copy of the database (e.g. with rsync). So far, it required to stop the whole cluster and use tools like 'mysql' or sqlite3' to edit: - the 'pt' table in databases, - the 'config.nid' values of the new nodes. With this new option, if you already have 1 replica, you can set up new replicas with such fast raw copy, and without interruption of service. Obviously, this implies less redundancy during the operation.
-
- 26 Apr, 2019 3 commits
-
-
Julien Muchembled authored
--kill-mysqld should be combined with something like -f .3 -r .1 to give storage nodes enough time to recover. And also -D 0 to focus testing on the storage backend rather than NEO.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 16 Apr, 2019 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 05 Apr, 2019 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
This fixes up commit be839e92.
-
Julien Muchembled authored
-
- 21 Mar, 2019 1 commit
-
-
Julien Muchembled authored
This is not used currently.
-
- 16 Mar, 2019 1 commit
-
-
Julien Muchembled authored
If the source DB is lost during the import and then restored from a backup, all new transactions have to written back again on resume. It is the most common case for which the writeback hits the maximum number of transactions per partition to process at each iteration; the previous code was buggy in that it could skip transactions.
-
- 11 Mar, 2019 1 commit
-
-
Julien Muchembled authored
-
- 31 Dec, 2018 1 commit
-
-
Julien Muchembled authored
This makes commit 3c7a3160 (storage: speed up reads by indexing 'obj' primarily by 'oid') effective for SQLite. The fake changes in test data are because we don't force upgrade for this optimization.
-
- 07 Aug, 2018 1 commit
-
-
Julien Muchembled authored
Besides the use of another module for option parsing, the main change is that there's no more Config class that mixes configuration for different components. Application classes now takes a simple 'dict' with parsed values. The changes in 'neoctl' are somewhat ugly, because command-line options are not defined on the command-line class, but this component is likely to disappear in the future. It remains possible to pass options via a configuration file. The code is a bit complex but isolated in neo.lib.config For SSL, the code may be simpler if we change for a single --ssl option that takes 3 paths. Not done to not break compatibility. Hence, the hack with an extra OptionList class in neo.lib.app A new functional test tests the 'neomigrate' script, instead of just the internal API to migrate data.
-
- 22 Jun, 2018 1 commit
-
-
Julien Muchembled authored
This commit adds a contraint when tweaking the partition table with replicas, so that cells of each partition are assigned as far as possible from each other, e.g. not on the same machine even if each one has several disks, and in any case not on the same storage device. Currently, the topology path of each node is automatically calculated by the storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev). To be improved: - Add a storage option to override the path: the 'tweak' algorithm can already handle topology paths of any length, so something like (room, machine, disk) could be done easily. - Write OS-specific code to determine the real hardware behind st_dev (e.g. 2 different 'st_dev' values may actually refer to the same disk, because of layers like partitioning, device-mapper, loop, btrfs subvolumes, and so on). - Make 'neoctl' report in some way if the PT is optimal. Meanwhile, if it isn't, the master only logs a WARNING during tweak.
-
- 21 Jun, 2018 1 commit
-
-
Julien Muchembled authored
-
- 04 Jun, 2018 1 commit
-
-
Julien Muchembled authored
-
- 30 May, 2018 2 commits
-
-
Julien Muchembled authored
Although data that are already transferred aren't transferred again, checking that the data are there for a whole partition can still be a lot of work for big databases. This commit is a major performance improvement in that a storage node that gets disconnected for a short time now gets fully operational quite instantaneously because it only has to replicate the new data. Before, the time to recover depended on the size of the DB. For OUT_OF_DATE cells, the difficult part was that they are writable and can then contain holes, so we can't just take the last TID in trans/obj (we wrongly did that at the beginning, and then committed 6b1f198f as a workaround). We solve that by storing up to where it was up-to-date: this value is initialized from the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING. There's actually one such OUT_OF_DATE TID per assigned cell (backends store these values in the 'pt' table). Otherwise, a cell that still has a lot to replicate would still cause all other cells to resume from the a very small TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node (as a result of tweak). For UP_TO_DATE cells of a backup cluster, replication was resumed from the maximum TID at which all assigned cells are known to be fully replicated. Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of extra work for others, the worst case being when setting up a backup cluster (it always restarted from ZERO_TID as long as at least 1 cell was still empty). Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to store extra information: we simply look at the last TIDs in trans/obj. We even handle trans & obj independently, to minimize the work in 1 table (i.e. trans since it's processed first) if the other is late (obj). There's a small change in the protocol so that OUT_OF_DATE enum value equals 0. This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column as the cell state. Note about MySQL changes in commit ca58ccd7: what we did as a workaround is not one any more. Now, we do so much on Python side that it's unlikely we could reduce the number of queries using GROUP BY. We even stopped doing that for SQLite.
-
Julien Muchembled authored
-
- 24 May, 2018 4 commits
-
-
Julien Muchembled authored
Future migration steps are likely to alter tables, possibly with transformation of data, and this is complicated for both supported backend.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 17 May, 2018 1 commit
-
-
Julien Muchembled authored
- for FileStorage DB, make sure a transaction index is built at most once - for other DB types, reopen the DB in the subprocess Now that we have specific code for FileStorage, the generic case is not tested anymore. We should add a test using ZEO. Or better, and in some way crazy, one with NEO, but one would need to fix a special case in getObject.
-
- 16 May, 2018 3 commits
-
-
Julien Muchembled authored
The protocol version is increased to ensure that client nodes are able to handle an empty 'extension' field in AnswerTransactionInformation. It also means that once new transactions are written, going back to a previous revision is not possible.
-
Julien Muchembled authored
'title' means both process name and command line. This is cosmetics so it won't fail if the 'setproctitle' module is not available.
-
Julien Muchembled authored
A new subprocess is used to: - fetch data from the source DB - repickle to change oids (when merging several DB) - compress - checksum This is mostly useful for the second step, which is relatively much slower than any other step, while not releasing the GIL. By using a second CPU core, it is also often possible to use a better compression algorithm for free (e.g. zlib=9). Actually, smaller data can speed up the writing process. In addition to greatly speed up the import by parallelizing fetch+process with write, it also makes the main process more reactive to queries from client nodes.
-
- 15 May, 2018 1 commit
-
-
Julien Muchembled authored
By doing the work with secondary connections to the underlying databases, asynchronously and in a separate process, this should have minimal impact on the performance of the storage node. Extra complexity comes from backends that may lose connection to the database (here MySQL): this commit fully implements reconnection.
-
- 11 May, 2018 2 commits
-
-
Julien Muchembled authored
For FileStorage DB, this avoids: - keeping a lock on the source DB during the whole import, - saving the whole index when the import was resumed.
-
Julien Muchembled authored
-
- 07 May, 2018 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 18 Apr, 2018 1 commit
-
-
Julien Muchembled authored
It was disabled by mistake in commit fd80cc30.
-
- 16 Apr, 2018 2 commits
-
-
Julien Muchembled authored
In the Importer storage backend, the repickler code never really worked with ZODB 5 (use of protocol > 1), and now the test does not pass anymore. The other issues caused by ZODB commit 12ee41c47310156027a674932df34b60de86ba36 are fixed: TypeError: list indices must be integers, not binary ValueError: unsupported pickle protocol: 3 Although not necessary as long as we don't support Python 3, this commit also replaces `str` by `bytes` in a few places.
-
Julien Muchembled authored
When importing a FileStorage DB without interruption and without having to serve client nodes, the index built by speedupFileStorageTxnLookup is useless. Such case happens when doing simulation tests and on DB with many oids, it can take a lot of time and memory for nothing.
-
- 13 Apr, 2018 1 commit
-
-
Julien Muchembled authored
-
- 12 Apr, 2018 1 commit
-
-
Julien Muchembled authored
-
- 08 Jan, 2018 1 commit
-
-
Julien Muchembled authored
# Previous status The issue was that we had extreme storage fragmentation from the point of view of the replication algorithm, which processes one partition at a time. By using an autoincrement for the 'data' table, rows were ordered by the time at which they were added: - parts may be the result of replication -> ordered by partition, tid, oid - other rows are globally sorted by tid Which means that when scanning a given partition, many rows were skipped all the time: - if readahead is bigger enough, the efficiency is 1/N for a node with N partitions assigned - else, it is worse because it seeks all the time For huge databases, the replication was horribly slow, in particular from HDD. # Chosen solution This commit changes how ids are generated to somehow split 'data' per partition. The backend tracks 1 last id per assigned partition, where the 16 higher bits contains the partition. Keep in mind that the value of id has no meaning and it's only chosen for performance reasons. IOW, a row can be referred by an oid of a partition different than the 16 higher bits of id: - there's no migration needed and the 16 higher bits of all existing rows are 0 - in case of deduplication, a row can still be shared by different partitions Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement on existing databases. ## Downsides On insertion, increasing the number of partitions now slows down significantly: for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12 partitions, the difference remains negligible. The solution for this issue will be to enable to increase the number of partitions efficiently, so that nodes can keep a small number of them, even for DB that are expected to grow so much that many nodes are added over time: such feature was already considered so that users don't have to worry anymore about this obscure setting at database creation. Read performance is only slowed down for applications that read a lot of data that were written contiguously, but split in small blocks. A solution is to extend ZODB so that the application tells it to chose new oids that will end up in the same partition. Like for insertion, there should not be too many partitions. With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to collect all last ids at startup when there are many partitions. ## Other advantages - The storage layout of data is now always the same and does not depend on whether rows came from replication or commits. - Efficient deletion of partition to free space in-place will be possible. # Considered alternative The only serious alternative was to replicate as many partitions as possible at the same time, ideally all assigned partitions, but it's not always possible. For best performance, it would often require to synchronize new nodes, or even all of them, so that thesource nodes don't have to scan 'data' several times. If existing nodes are kept, all data that aren't copied to the newly added nodes have to be skipped. If the number of nodes is multiplied by N, the efficiency is 1-1/N at best (synchronized nodes), else it's even worse because partitions are somehow shuffled. Checking/replacing a single node would remain slow when there are several source nodes. At last, such an algorithm would be much more complex and we would not have the other advantages listed above.
-