- 17 May, 2018 1 commit
-
-
Julien Muchembled authored
- for FileStorage DB, make sure a transaction index is built at most once - for other DB types, reopen the DB in the subprocess Now that we have specific code for FileStorage, the generic case is not tested anymore. We should add a test using ZEO. Or better, and in some way crazy, one with NEO, but one would need to fix a special case in getObject.
-
- 16 May, 2018 5 commits
-
-
Julien Muchembled authored
The protocol version is increased to ensure that client nodes are able to handle an empty 'extension' field in AnswerTransactionInformation. It also means that once new transactions are written, going back to a previous revision is not possible.
-
Julien Muchembled authored
The correct way to specify a start/stop tid is when constructing the 'source' object, hence the remove of start/stop args. In fact, source.iterator() does not always take such args. On the other hand, when resuming import, Application.importFrom must manage with incomplete preindex.
-
Julien Muchembled authored
Same as previous commit: only cosmetics so optional.
-
Julien Muchembled authored
'title' means both process name and command line. This is cosmetics so it won't fail if the 'setproctitle' module is not available.
-
Julien Muchembled authored
A new subprocess is used to: - fetch data from the source DB - repickle to change oids (when merging several DB) - compress - checksum This is mostly useful for the second step, which is relatively much slower than any other step, while not releasing the GIL. By using a second CPU core, it is also often possible to use a better compression algorithm for free (e.g. zlib=9). Actually, smaller data can speed up the writing process. In addition to greatly speed up the import by parallelizing fetch+process with write, it also makes the main process more reactive to queries from client nodes.
-
- 15 May, 2018 1 commit
-
-
Julien Muchembled authored
By doing the work with secondary connections to the underlying databases, asynchronously and in a separate process, this should have minimal impact on the performance of the storage node. Extra complexity comes from backends that may lose connection to the database (here MySQL): this commit fully implements reconnection.
-
- 11 May, 2018 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
For FileStorage DB, this avoids: - keeping a lock on the source DB during the whole import, - saving the whole index when the import was resumed.
-
Julien Muchembled authored
-
- 07 May, 2018 4 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 18 Apr, 2018 3 commits
-
-
Julien Muchembled authored
It was disabled by mistake in commit fd80cc30.
-
Julien Muchembled authored
- Stop using NEO source code as sample data. - For ZODB5, add a test that does not merge several DB.
-
Julien Muchembled authored
-
- 16 Apr, 2018 3 commits
-
-
Julien Muchembled authored
In the Importer storage backend, the repickler code never really worked with ZODB 5 (use of protocol > 1), and now the test does not pass anymore. The other issues caused by ZODB commit 12ee41c47310156027a674932df34b60de86ba36 are fixed: TypeError: list indices must be integers, not binary ValueError: unsupported pickle protocol: 3 Although not necessary as long as we don't support Python 3, this commit also replaces `str` by `bytes` in a few places.
-
Julien Muchembled authored
-
Julien Muchembled authored
When importing a FileStorage DB without interruption and without having to serve client nodes, the index built by speedupFileStorageTxnLookup is useless. Such case happens when doing simulation tests and on DB with many oids, it can take a lot of time and memory for nothing.
-
- 13 Apr, 2018 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
This was forgotten in commit 5de0ff3a.
-
- 12 Apr, 2018 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
The Importer storage backend already does this.
-
- 10 Apr, 2018 1 commit
-
-
Julien Muchembled authored
This fixes a random failure in testSafeTweak: failureException: 'UU.|U.U|.UU' != 'UU.|.UU|U.U'
-
- 29 Mar, 2018 2 commits
-
-
Julien Muchembled authored
This is a follow-up of commit 2ca7c335, which changed 'tweak' not to discard readable cells too quickly. The scenario of a storage being lost whereas it has feeding cells was forgotten. These must be discarded immediately, otherwise we end up with more up-to-date cells than wanted. Without the change in outdate(), testSafeTweak would end with: UU.|U.U|UUU Once replication is optimized not to always restart checking cells from the beginning: - Remembering that an out-of-date cell was feeding could be a safer option, but it may not be worth the extra complexity. - Another possibility may be to replace the FEEDING state by an automatic partial tweak that only discards up-to-date cells too many whenever a cell becomes up-to-date.
-
Julien Muchembled authored
-
- 20 Mar, 2018 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 14 Mar, 2018 1 commit
-
-
Julien Muchembled authored
For records that undo object creation, None values are used at the backend level whereas the protocol is not designed to serialize None for any field. Therefore, a dance done in many places around packet serialization, using the specific 0/ZERO_HASH/'' triplet to represent a deleted oid. For replication, it was missing at the sender side, leading to the following crash: Traceback (most recent call last): File "neo/storage/app.py", line 147, in run self._run() File "neo/storage/app.py", line 178, in _run self.doOperation() File "neo/storage/app.py", line 257, in doOperation next(task_queue[-1]) or task_queue.rotate() File "neo/storage/handlers/storage.py", line 271, in push conn.send(Packets.AddObject(oid, *object), msg_id) File "neo/lib/protocol.py", line 234, in __init__ self._fmt.encode(buf.write, args) File "neo/lib/protocol.py", line 345, in encode return self._trace(self._encode, writer, items) File "neo/lib/protocol.py", line 334, in _trace return method(*args) File "neo/lib/protocol.py", line 367, in _encode item.encode(writer, value) File "neo/lib/protocol.py", line 345, in encode return self._trace(self._encode, writer, items) File "neo/lib/protocol.py", line 342, in _trace raise ParseError(self, trace) ParseError: at add_object/checksum: File "neo/lib/protocol.py", line 553, in _encode assert len(checksum) == 20, (len(checksum), checksum) TypeError: object of type 'NoneType' has no len()
-
- 13 Mar, 2018 1 commit
-
-
Julien Muchembled authored
-
- 02 Mar, 2018 3 commits
-
-
Julien Muchembled authored
Before, it waited for upstream activity until all partitions are touched. However, when upstream is idle the backup cluster could remain stuck forever if it was interrupted whereas some cells were still late.
-
Julien Muchembled authored
The 'min_tid < new_tid' assertion failed when jumping to the past.
-
Julien Muchembled authored
Given that: - read locks are only taken by transactions (not replication) - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions are synchronized up to different tids there was a race condition with the master node replying to LastTransaction with a TID that may not be replicated yet by all replicas, potentially causing such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data too early. IOW, even if the cluster does contain the data up to `getBackupTid(max)`, it is only readable by NEO clients up to `getBackupTid(min)` as long as the cluster is in BACKINGUP state.
-
- 17 Jan, 2018 1 commit
-
-
Kirill Smelkov authored
Usage of supportsTransactionalUndo() was removed from ZODB in 2007 - see e.g. the following commits: https://github.com/zopefoundation/ZODB/commit/a06bfc03 https://github.com/zopefoundation/ZODB/commit/e667b022 https://github.com/zopefoundation/ZODB/commit/f595f7e7 ... /reviewed-by @vpelletier /reviewed-on nexedi/neoppod!8
-
- 11 Jan, 2018 1 commit
-
-
Julien Muchembled authored
The issue was that at startup, or after nodes are back, the previous code prevented full load balancing until some data are written. It was like this to limit the number of connections, which does not matter anymore (see commit 77132157).
-
- 08 Jan, 2018 1 commit
-
-
Julien Muchembled authored
# Previous status The issue was that we had extreme storage fragmentation from the point of view of the replication algorithm, which processes one partition at a time. By using an autoincrement for the 'data' table, rows were ordered by the time at which they were added: - parts may be the result of replication -> ordered by partition, tid, oid - other rows are globally sorted by tid Which means that when scanning a given partition, many rows were skipped all the time: - if readahead is bigger enough, the efficiency is 1/N for a node with N partitions assigned - else, it is worse because it seeks all the time For huge databases, the replication was horribly slow, in particular from HDD. # Chosen solution This commit changes how ids are generated to somehow split 'data' per partition. The backend tracks 1 last id per assigned partition, where the 16 higher bits contains the partition. Keep in mind that the value of id has no meaning and it's only chosen for performance reasons. IOW, a row can be referred by an oid of a partition different than the 16 higher bits of id: - there's no migration needed and the 16 higher bits of all existing rows are 0 - in case of deduplication, a row can still be shared by different partitions Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement on existing databases. ## Downsides On insertion, increasing the number of partitions now slows down significantly: for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12 partitions, the difference remains negligible. The solution for this issue will be to enable to increase the number of partitions efficiently, so that nodes can keep a small number of them, even for DB that are expected to grow so much that many nodes are added over time: such feature was already considered so that users don't have to worry anymore about this obscure setting at database creation. Read performance is only slowed down for applications that read a lot of data that were written contiguously, but split in small blocks. A solution is to extend ZODB so that the application tells it to chose new oids that will end up in the same partition. Like for insertion, there should not be too many partitions. With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to collect all last ids at startup when there are many partitions. ## Other advantages - The storage layout of data is now always the same and does not depend on whether rows came from replication or commits. - Efficient deletion of partition to free space in-place will be possible. # Considered alternative The only serious alternative was to replicate as many partitions as possible at the same time, ideally all assigned partitions, but it's not always possible. For best performance, it would often require to synchronize new nodes, or even all of them, so that thesource nodes don't have to scan 'data' several times. If existing nodes are kept, all data that aren't copied to the newly added nodes have to be skipped. If the number of nodes is multiplied by N, the efficiency is 1-1/N at best (synchronized nodes), else it's even worse because partitions are somehow shuffled. Checking/replacing a single node would remain slow when there are several source nodes. At last, such an algorithm would be much more complex and we would not have the other advantages listed above.
-
- 05 Jan, 2018 3 commits
-
-
Julien Muchembled authored
For existing DB, altering the table may be doable with schema editing and clean up of sqlite_sequence.
-
Julien Muchembled authored
-
Julien Muchembled authored
getObject becomes faster because it does not use secondary index anymore. Only the primary one. This frees RAM during normal operation. For MySQL, DatabaseManager._getObject is sped up by ~3% for in-memory loads. An improvement of ~1% from ERP5 was also mesured for IO-bound loads. On insertion, the fast index is (`partition`, tid, oid) because we almost always insert lines with increasing tid, whereas oid values are more random. Although the value (data_id+value_tid) is moved from the fast to the slow index, this should have little impact on performance because the value size is quite small compared to the key. The impact on replication should also be negligible: - a little faster when there's no oid to replicate: only the secondary index, smaller, is scanned - otherwise: the (slightly) biggest index is scanned randomly On disk usage, an increase of ~4% was observed for TokuDB. Less compressibility ? Any link with https://jira.percona.com/browse/TDB-86 ?
-