neoppod:v1.9 commitshttps://lab.nexedi.com/nexedi/neoppod/-/commits/v1.92018-03-13T19:10:07+01:00https://lab.nexedi.com/nexedi/neoppod/-/commit/1b57a7ae0566dffc2e37f8b7d1d49e654254d701Release version 1.92018-03-13T19:10:07+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/2722979342e5492262c15307ffdab05b9a15aaa7master: fix resumption of backup replication (internal or not)2018-03-02T18:33:56+01:00Julien Muchembledjm@nexedi.com
Before, it waited for upstream activity until all partitions are touched.
However, when upstream is idle the backup cluster could remain stuck forever
if it was interrupted whereas some cells were still late.https://lab.nexedi.com/nexedi/neoppod/-/commit/7b2e6752561da38893a9f67408242df0718e8e58master: fix/simplify generation of TID2018-03-02T18:33:56+01:00Julien Muchembledjm@nexedi.com
The 'min_tid < new_tid' assertion failed when jumping to the past.https://lab.nexedi.com/nexedi/neoppod/-/commit/ca2f7061fbce046da26a4b7517932b2a985aa8f2master: fix possible failure when reading data in a backup cluster with replicas2018-03-02T18:33:53+01:00Julien Muchembledjm@nexedi.com
Given that:
- read locks are only taken by transactions (not replication)
- in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
are synchronized up to different tids
there was a race condition with the master node replying to LastTransaction
with a TID that may not be replicated yet by all replicas, potentially causing
such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
too early.
IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
it is only readable by NEO clients up to `getBackupTid(min)` as long as the
cluster is in BACKINGUP state.https://lab.nexedi.com/nexedi/neoppod/-/commit/f95f336a524fb41cb8cc54841a5b04fce0dd9cc0client: kill .supportsTransactionalUndo()2018-01-17T12:31:00+01:00Kirill Smelkovkirr@nexedi.com
Usage of supportsTransactionalUndo() was removed from ZODB in 2007 - see
e.g. the following commits:
<a href="https://github.com/zopefoundation/ZODB/commit/a06bfc03" rel="nofollow noreferrer noopener" target="_blank">https://github.com/zopefoundation/ZODB/commit/a06bfc03</a>
<a href="https://github.com/zopefoundation/ZODB/commit/e667b022" rel="nofollow noreferrer noopener" target="_blank">https://github.com/zopefoundation/ZODB/commit/e667b022</a>
<a href="https://github.com/zopefoundation/ZODB/commit/f595f7e7" rel="nofollow noreferrer noopener" target="_blank">https://github.com/zopefoundation/ZODB/commit/f595f7e7</a>
...
/reviewed-by <a href="/vpelletier" data-user="23" data-reference-type="user" data-container="body" data-placement="top" data-html="true" class="gfm gfm-project_member" title="Vincent Pelletier">@vpelletier</a>
/reviewed-on <a href="https://lab.nexedi.com/nexedi/neoppod/merge_requests/8" data-original="https://lab.nexedi.com/nexedi/neoppod/merge_requests/8" data-link="false" data-link-reference="true" data-project="72" data-merge-request="1753" data-project-path="nexedi/neoppod" data-iid="8" data-mr-title="client: kill .supportsTransactionalUndo()" data-reference-type="merge_request" data-container="body" data-placement="top" data-html="true" title="" class="gfm gfm-merge_request">nexedi/neoppod!8</a>https://lab.nexedi.com/nexedi/neoppod/-/commit/8dce4bbf83107fad96e4a5671337e9958f4220a5client: for read accesses, pick a random good node, connected or not2018-01-11T15:08:02+01:00Julien Muchembledjm@nexedi.com
The issue was that at startup, or after nodes are back, the previous code
prevented full load balancing until some data are written.
It was like this to limit the number of connections, which does not matter
anymore (see commit <a href="/nexedi/neoppod/-/commit/7713215702fbf296458ff1001a3210f4deeeabad" data-original="7713215702fbf296458ff1001a3210f4deeeabad" data-link="false" data-link-reference="false" data-project="72" data-commit="7713215702fbf296458ff1001a3210f4deeeabad" data-reference-type="commit" data-container="body" data-placement="top" data-html="true" title="client: do not limit the number of open connections to storage nodes" class="gfm gfm-commit has-tooltip">77132157</a>).https://lab.nexedi.com/nexedi/neoppod/-/commit/f4dd4bab839a5f5bf69f74fcf5a34e21f337a1fcstorage: optimize storage layout of raw data for replication2018-01-08T11:30:36+01:00Julien Muchembledjm@nexedi.com
# Previous status
The issue was that we had extreme storage fragmentation from the point of view
of the replication algorithm, which processes one partition at a time.
By using an autoincrement for the 'data' table, rows were ordered by the time
at which they were added:
- parts may be the result of replication -> ordered by partition, tid, oid
- other rows are globally sorted by tid
Which means that when scanning a given partition, many rows were skipped all
the time:
- if readahead is bigger enough, the efficiency is 1/N for a node with N
partitions assigned
- else, it is worse because it seeks all the time
For huge databases, the replication was horribly slow, in particular from HDD.
# Chosen solution
This commit changes how ids are generated to somehow split 'data'
per partition. The backend tracks 1 last id per assigned partition, where the
16 higher bits contains the partition. Keep in mind that the value of id has no
meaning and it's only chosen for performance reasons. IOW, a row can be
referred by an oid of a partition different than the 16 higher bits of id:
- there's no migration needed and the 16 higher bits of all existing rows are 0
- in case of deduplication, a row can still be shared by different partitions
Due to <a href="https://jira.mariadb.org/browse/MDEV-12836" rel="nofollow noreferrer noopener" target="_blank">https://jira.mariadb.org/browse/MDEV-12836</a>, we leave the autoincrement
on existing databases.
## Downsides
On insertion, increasing the number of partitions now slows down significantly:
for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
partitions, the difference remains negligible. The solution for this issue will
be to enable to increase the number of partitions efficiently, so that nodes
can keep a small number of them, even for DB that are expected to grow so much
that many nodes are added over time: such feature was already considered so
that users don't have to worry anymore about this obscure setting at database
creation.
Read performance is only slowed down for applications that read a lot of data
that were written contiguously, but split in small blocks. A solution is to
extend ZODB so that the application tells it to chose new oids that will end up
in the same partition. Like for insertion, there should not be too many
partitions.
With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
collect all last ids at startup when there are many partitions.
## Other advantages
- The storage layout of data is now always the same and does not depend on
whether rows came from replication or commits.
- Efficient deletion of partition to free space in-place will be possible.
# Considered alternative
The only serious alternative was to replicate as many partitions as possible at
the same time, ideally all assigned partitions, but it's not always possible.
For best performance, it would often require to synchronize new nodes, or even
all of them, so that thesource nodes don't have to scan 'data' several times.
If existing nodes are kept, all data that aren't copied to the newly added
nodes have to be skipped. If the number of nodes is multiplied by N, the
efficiency is 1-1/N at best (synchronized nodes), else it's even worse
because partitions are somehow shuffled.
Checking/replacing a single node would remain slow when there are several
source nodes.
At last, such an algorithm would be much more complex and we would not have the
other advantages listed above.https://lab.nexedi.com/nexedi/neoppod/-/commit/7b497b8e3366cf68374bcf8ea3fbb81bbc0e2e18sqlite: remove useless AUTOINCREMENT for data.id (reuse of deleted ids is fine)2018-01-05T21:14:38+01:00Julien Muchembledjm@nexedi.com
For existing DB, altering the table may be doable with schema editing and
clean up of sqlite_sequence.https://lab.nexedi.com/nexedi/neoppod/-/commit/d289050eb19ce5a8c6584f64531033fcf8bc4644storage: automatic upgrade of 'obj' table (change of indices)2018-01-05T17:55:12+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/3c7a316043fcbe09f4e6c7982c1b13eb0aefe7b6storage: speed up reads by indexing 'obj' primarily by 'oid' (instead of 'tid')2018-01-05T15:23:02+01:00Julien Muchembledjm@nexedi.com
getObject becomes faster because it does not use secondary index anymore.
Only the primary one. This frees RAM during normal operation. For MySQL,
DatabaseManager._getObject is sped up by ~3% for in-memory loads.
An improvement of ~1% from ERP5 was also mesured for IO-bound loads.
On insertion, the fast index is (`partition`, tid, oid) because we almost
always insert lines with increasing tid, whereas oid values are more random.
Although the value (data_id+value_tid) is moved from the fast to the slow index,
this should have little impact on performance because the value size is quite
small compared to the key.
The impact on replication should also be negligible:
- a little faster when there's no oid to replicate: only the secondary index,
smaller, is scanned
- otherwise: the (slightly) biggest index is scanned randomly
On disk usage, an increase of ~4% was observed for TokuDB.
Less compressibility ? Any link with <a href="https://jira.percona.com/browse/TDB-86" rel="nofollow noreferrer noopener" target="_blank">https://jira.percona.com/browse/TDB-86</a> ?https://lab.nexedi.com/nexedi/neoppod/-/commit/ca7acefc571cef72dab317a669332a82289fb635storage: pass schema of tables to migration methods2018-01-05T15:22:00+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/04f6d9c31f7fbf8a91a70969ed180e5eb1f6227astorage: update backend version between each migration step2018-01-05T15:22:00+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/875fc1b9376a4100acd00a8bdad7455d29a17fc1debug: add helper to run code outside the signal handler2018-01-05T15:08:20+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/a414f91f07ca4de045d31e536ea1610862ab2c83Preserve 'packed' flag on import/iteration2017-12-21T15:49:22+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/5abfa5fd23265d69b281ac41ff87f7f9ad382df0fixup! storage: speed up replication by not getting object next_serial for no...2017-12-15T20:59:25+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/121b3882b170bf9ffaa99c72f5362793ba5b5d56Merge "client: fix accounting of cache size"2017-12-15T15:28:20+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/5b02f44b4d66bd39065f11f151bb84a363cb04d0client: fix accounting of cache size2017-12-15T15:27:36+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/f2070ca445e3f43f05dbef07ab9572ac630b54fbdoc: comments, fixups2017-12-13T11:58:42+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/c76b3a0a9097686665f06024f20c824671e4639bclient: account for cache hit/miss statistics2017-12-11T12:03:43+01:00Kirill Smelkovkirr@nexedi.com
This information is handy to see how well cache performs.
Amended by Julien Muchembled:
- do not abbreviate some existing field names in repr result (asking the
user to look at the source code in order to decipher logs is not nice)
- hit: change from %.1f to %.3g
- hit: hide it completely if nload is 0
- use __future__.division instead of adding more casts to floathttps://lab.nexedi.com/nexedi/neoppod/-/commit/d1f524228c5775a4b0914d274f79df4f33e683e9client: remove redundant information from cache's __repr__2017-12-11T12:00:59+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/d83cb87262d4092d6e409c506d462487a6a2cabacache: fix possible endless loop in __repr__/_iterQueue2017-12-11T11:58:26+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/be839e92bf47e191c7de2cd5ca196da89cad7035storage: speed up replication by not getting object next_serial for nothing2017-12-05T18:11:13+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/c25e68bc415a67b535577756526f90b0688ec3eastorage: speed up replication by sending bigger network packets2017-12-05T14:25:12+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/96aeb71617c0824a78652713cc06738d3358ae41neoctl: remove ignored option2017-12-04T18:42:26+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/a1082cbc85746fad52ef10e7e4af8da03a3b020fclient: bug found, add log to collect more information2017-11-21T18:12:01+01:00Julien Muchembledjm@nexedi.com
INFO Z2 Log files reopened successfully
INFO SignalHandler Caught signal SIGTERM
INFO Z2 Shutting down fast
INFO ZServer closing HTTP to new connections
ERROR ZODB.Connection Couldn't load state for BTrees.LOBTree.LOBucket 0xc12e29
Traceback (most recent call last):
File "ZODB/Connection.py", line 909, in setstate
self._setstate(obj, oid)
File "ZODB/Connection.py", line 953, in _setstate
p, serial = self._storage.load(oid, '')
File "neo/client/Storage.py", line 81, in load
return self.app.load(oid)[:2]
File "neo/client/app.py", line 355, in load
data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
File "neo/client/app.py", line 387, in _loadFromStorage
askStorage)
File "neo/client/app.py", line 297, in _askStorageForRead
self.sync()
File "neo/client/app.py", line 898, in sync
self._askPrimary(Packets.Ping())
File "neo/client/app.py", line 163, in _askPrimary
return self._ask(self._getMasterConnection(), packet,
File "neo/client/app.py", line 177, in _getMasterConnection
result = self.master_conn = self._connectToPrimaryNode()
File "neo/client/app.py", line 202, in _connectToPrimaryNode
index = (index + 1) % len(master_list)
ZeroDivisionError: integer division or modulo by zerohttps://lab.nexedi.com/nexedi/neoppod/-/commit/acef35717fd674dc5271c0d2baa62e48005d5a11client: new 'cache-size' Storage option2017-11-19T22:10:14+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/9c56e9cdd4becdb7ecf30759ad7deb06eb147242doc: mention HTTPS URLs when possible2017-11-17T21:45:52+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/e2e2d895d46f193afc8fc71b0447effa19992615doc: update comment in neolog about Python issue 137732017-11-17T21:45:52+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/188c55f99622a42beebf75596dae1c53f5e1fc62neolog: add support for xz-compressed logs, using external xzcat commands2017-11-17T21:45:52+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/9ee4e04db5e7639888518887ac9c0ed3c2f4828cneolog: --from option now also tries to parse with dateutil2017-11-17T21:45:45+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/d6f61422f2f908362f985feb7a0ce335f67486c3importer: do not crash if a backup cluster tries to replicate2017-11-15T17:57:43+01:00Julien Muchembledjm@nexedi.com
It's not possible yet to replicate a node that is importing data.
One must wait that the migration is finished.https://lab.nexedi.com/nexedi/neoppod/-/commit/ca75709f8303ac0b5b6d6618e958b873d3f6c790storage: disable data deduplication by default2017-11-07T12:13:33+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/03b5b47eaecad8dca4d4deac422020598e79f8aaRelease version 1.8.12017-11-07T10:45:47+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/4feea25f68d06df7ac289a416a88e22cb681dd72neomigrate: fix typo in a warning message2017-10-27T15:58:31+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/65bc5c118fbc5692e71c76b206ac897b0677ada6fixup! storage: fix possible crash when delaying replication requests2017-09-29T20:14:40+02:00Julien Muchembledjm@nexedi.com
This reverts commit <a href="/nexedi/neoppod/-/commit/d3c22487c0fc207b543816ddc6170944ce4b4d3f" data-original="d3c22487c0fc207b543816ddc6170944ce4b4d3f" data-link="false" data-link-reference="false" data-project="72" data-commit="d3c22487c0fc207b543816ddc6170944ce4b4d3f" data-reference-type="commit" data-container="body" data-placement="top" data-html="true" title="storage: fix possible crash when delaying replication requests" class="gfm gfm-commit has-tooltip">d3c22487</a> partially
and fixes the bug in a much simpler way.https://lab.nexedi.com/nexedi/neoppod/-/commit/7e186442e561d70afdc61a02ab1a6a6d341f3e9aneoctl: make cell padding consistent when displaying the partition table2017-09-29T18:22:48+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/d3c22487c0fc207b543816ddc6170944ce4b4d3fstorage: fix possible crash when delaying replication requests2017-09-29T16:10:27+02:00Julien Muchembledjm@nexedi.com
Traceback (most recent call last):
[...]
File "neo/storage/handlers/client.py", line 115, in askStoreObject
*e.args)
File "neo/lib/handler.py", line 333, in queueEvent
self.sortQueuedEvents()
File "neo/lib/handler.py", line 326, in <lambda>
self._event_queue.sort(key=key))()
File "neo/storage/transactions.py", line 67, in __lt__
return self.locking_tid < other.locking_tid
AttributeError: 'NoneType' object has no attribute 'locking_tid'
Pending events:
(None, <askFetchTransactions: ...>)
(<Transaction(C13, locking_tid=03c266508a058388, tid=None, age=0.21s) at 0x7f086bbc3d50>, <_askStoreObject: ...>)https://lab.nexedi.com/nexedi/neoppod/-/commit/49631a9f23492f886ec50cf1a8bfd68ee4295bc5qa: bug found in assignment of storage node ids, add test2017-09-11T19:44:00+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/aeeaef8942ac58c31b2af72f0c0ec08cbca23dafUpdate comment of RECOVERING state2017-09-05T17:35:30+02:00Julien Muchembledjm@nexedi.com
It was out-of-date since commit <a href="/kirr/neo/-/commit/23b6a66a1e83084b3c825189fdeaf15e6a41d00c" data-original="23b6a66a1e83084b3c825189fdeaf15e6a41d00c" data-link="false" data-link-reference="false" data-project="73" data-commit="23b6a66a1e83084b3c825189fdeaf15e6a41d00c" data-reference-type="commit" data-container="body" data-placement="top" data-html="true" title="Reimplement election (of the primary master)" class="gfm gfm-commit has-tooltip">23b6a66a</a>.https://lab.nexedi.com/nexedi/neoppod/-/commit/524ec26973a7867771e9145cd6f1263b1f7fd8f4Add support for OpenSSL >= 1.12017-08-29T01:36:58+02:00Julien Muchembledjm@nexedi.com