Commits · v1.10 · nexedi / neoppod

16 Jul, 2018 1 commit
- Release version 1.10 · 1ef5c1ba
  Julien Muchembled authored Jul 16, 2018
  
  1ef5c1ba
22 Jun, 2018 2 commits

Maximize resiliency by taking into account the topology of storage nodes · 97af23cc

Julien Muchembled authored Jun 18, 2018

This commit adds a contraint when tweaking the partition table with replicas,
so that cells of each partition are assigned as far as possible from each
other, e.g. not on the same machine even if each one has several disks, and
in any case not on the same storage device.

Currently, the topology path of each node is automatically calculated by the
storage backend. Both MySQL and SQLite return a 2-tuple (host, st_dev).
To be improved:
- Add a storage option to override the path: the 'tweak' algorithm can already
  handle topology paths of any length, so something like (room, machine, disk)
  could be done easily.
- Write OS-specific code to determine the real hardware behind st_dev
  (e.g. 2 different 'st_dev' values may actually refer to the same disk,
   because of layers like partitioning, device-mapper, loop, btrfs subvolumes,
   and so on).
- Make 'neoctl' report in some way if the PT is optimal. Meanwhile,
  if it isn't, the master only logs a WARNING during tweak.

97af23cc

storage: also commit updated cell TID at each replicated chunk of 'obj' records · d4ea398d

Julien Muchembled authored Jun 22, 2018

This is a follow-up of commit b3dd6973
("Optimize resumption of replication by starting from a greater TID").
I missed the case where a storage node is restarted while it is replicating:
it lost the TID where it was interrupted.

Although we commit after each replicated chunk, to avoid transferring again
all the data from the beginning, it could still waste time to check that
the data are already replicated.

d4ea398d

21 Jun, 2018 1 commit
- storage: skip useless work when unlocking transactions · 745ee2b2
  Julien Muchembled authored Jun 21, 2018
  
  745ee2b2
19 Jun, 2018 4 commits
- qa: flush logs at the end of each test when -L is not used · 67df59ad
  Julien Muchembled authored Jun 15, 2018
```
Otherwise, they were either lost or flushed to a file of a next test.
```
  67df59ad
- qa: add a log in case that a mysterious bug happens again · 442bb43a
  Julien Muchembled authored Jun 19, 2018
```
The bug is likely to be in the test rather than in NEO.
```
  442bb43a
- storage: clarify log about data deletion of discarded cells · a0dd4a3b
  Julien Muchembled authored Jun 19, 2018
  
  a0dd4a3b
- debug: new example to run the profiler for 1 minute · d612fc84
  Julien Muchembled authored Jun 18, 2018
  
  d612fc84
04 Jun, 2018 1 commit
- mysql: fix replication of big oids (> 16M) · a992f21a
  Julien Muchembled authored Jun 04, 2018
  
  a992f21a
31 May, 2018 1 commit

tests/cluster: speedup waiting a bit · d08c83d4

Kirill Smelkov authored Apr 04, 2018

NEO functional tests use pdb.wait() in a few places, for example in
NEOCluster .run(), .start() and .expectCondition(). The wait
implementation uses polling with exponentially growing wait period.

With the following instrumentation

	--- a/neo/tests/cluster.py
	+++ b/neo/tests/cluster.py
	@@ -236,6 +236,7 @@ def wait(self, test, timeout):
	                         return False
	             finally:
	                 cluster_dict.release()
	+            print 'next_sleep:', next_sleep
	             sleep(next_sleep)
	         return True

during execution of functional tests it is not uncommon to see the
following sleep periods

	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.50625
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.50625
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.50625

.

Without going into reworking the wait mechanism to use real
notifications instead of polling, it was observed that the exponential
progression tends to create too coarse sleeps. Initial 0.1s interval was
found to be also too much.

This patch remove the exponential period growth and reduces period by order
of one magnitude. For functional tests timings on my computer it is thus:

before patch:

	Functional tests

	28 Tests, 0 Failed

	Title                     : TestRunner
	Date                      : 2018-04-04
	Node                      : deco
	Machine                   : x86_64
	System                    : Linux
	Python                    : 2.7.14

	Directory                 : /tmp/neo_tests/1522868674.115798
	Status                    : 100.000%
	NEO_TESTS_ADAPTER         : SQLite

	                               NEO TESTS REPORT

	              Test Module |  run  | unexpected | expected | skipped |  time
	--------------------------+-------+------------+----------+---------+----------
	                   Client |    6  |       .    |      .   |     .   |   8.51s
	                  Cluster |    7  |       .    |      .   |     .   |   9.84s
	                   Master |    4  |       .    |      .   |     .   |   9.68s
	                  Storage |   11  |       .    |      .   |     .   |  20.76s
	--------------------------+-------+------------+----------+---------+----------
	     neo.tests.functional |       |            |          |         |
	--------------------------+-------+------------+----------+---------+----------
	                  Summary |   28  |       .    |      .   |     .   |  48.79s
	--------------------------+-------+------------+----------+---------+----------

after patch:

	Functional tests

	28 Tests, 0 Failed

	Title                     : TestRunner
	Date                      : 2018-04-04
	Node                      : deco
	Machine                   : x86_64
	System                    : Linux
	Python                    : 2.7.14

	Directory                 : /tmp/neo_tests/1522868527.624376
	Status                    : 100.000%
	NEO_TESTS_ADAPTER         : SQLite

	                               NEO TESTS REPORT

	              Test Module |  run  | unexpected | expected | skipped |  time
	--------------------------+-------+------------+----------+---------+----------
	                   Client |    6  |       .    |      .   |     .   |   7.38s
	                  Cluster |    7  |       .    |      .   |     .   |   7.05s
	                   Master |    4  |       .    |      .   |     .   |   8.22s
	                  Storage |   11  |       .    |      .   |     .   |  19.22s
	--------------------------+-------+------------+----------+---------+----------
	     neo.tests.functional |       |            |          |         |
	--------------------------+-------+------------+----------+---------+----------
	                  Summary |   28  |       .    |      .   |     .   |  41.87s
	--------------------------+-------+------------+----------+---------+----------

in other words ~ 10% improvement for the whole time to run functional tests.

/reviewed-by @vpelletier, @jm
/reviewed-on !10

d08c83d4

30 May, 2018 6 commits

protocol: update packet docstrings · 9f0f2afe
Julien Muchembled authored Jan 18, 2018
```
/reviewed-on !9
```
9f0f2afe
Bump protocol version · f62f9bc9
Julien Muchembled authored Jan 31, 2018

f62f9bc9
protocol: a single byte is more than enough to encode enums · 52db5607
Julien Muchembled authored Jan 31, 2018

52db5607

protocol: small cleanup in packet registration · a00ab78b

Julien Muchembled authored Jan 18, 2018

I made a mistake in commit 13a64cfe
("Simplify definition of packets by computing automatically their codes").
My intention was that the code an answer packet continues to only differ by the
highest bit, as implemented now by this commit.

Before:
  0x0001, 0x8002   Ask1, Answer1
  0x0003           Notify2
  0x0004, 0x8005   Ask3, Answer3
  0x0006, 0x8007   Ask4, Answer4

After:
  0x0001, 0x8001   Ask1, Answer1
  0x0002           Notify2
  0x0003, 0x8003   Ask3, Answer3
  0x0004, 0x8004   Ask4, Answer4

This makes the protocol easier to document.

And by not wasting the range of possible values, it seems we have enough
space to shrink to a single byte.

This also removes code that became meaningless since that codes are generated
automatically.

a00ab78b

Optimize resumption of replication by starting from a greater TID · b3dd6973

Julien Muchembled authored May 17, 2018

Although data that are already transferred aren't transferred again, checking
that the data are there for a whole partition can still be a lot of work for
big databases. This commit is a major performance improvement in that a storage
node that gets disconnected for a short time now gets fully operational quite
instantaneously because it only has to replicate the new data. Before, the time
to recover depended on the size of the DB.

For OUT_OF_DATE cells, the difficult part was that they are writable and
can then contain holes, so we can't just take the last TID in trans/obj
(we wrongly did that at the beginning, and then committed
6b1f198f as a workaround). We solve that
by storing up to where it was up-to-date: this value is initialized from
the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING.

There's actually one such OUT_OF_DATE TID per assigned cell (backends store
these values in the 'pt' table). Otherwise, a cell that still has a lot to
replicate would still cause all other cells to resume from the a very small
TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node
(as a result of tweak).

For UP_TO_DATE cells of a backup cluster, replication was resumed from the
maximum TID at which all assigned cells are known to be fully replicated.
Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of
extra work for others, the worst case being when setting up a backup cluster
(it always restarted from ZERO_TID as long as at least 1 cell was still empty).
Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to
store extra information: we simply look at the last TIDs in trans/obj.
We even handle trans & obj independently, to minimize the work in 1 table
(i.e. trans since it's processed first) if the other is late (obj).

There's a small change in the protocol so that OUT_OF_DATE enum value equals 0.
This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column
as the cell state.

Note about MySQL changes in commit ca58ccd7:
what we did as a workaround is not one any more. Now, we do so much on Python
side that it's unlikely we could reduce the number of queries using GROUP BY.
We even stopped doing that for SQLite.

b3dd6973

importer: update comment about a workaround for ZODB3 · fa9664ee
Julien Muchembled authored May 29, 2018

fa9664ee

25 May, 2018 1 commit
- Micro-optimization of p64/u64 · ef387448
  Julien Muchembled authored May 25, 2018
  
  ef387448
24 May, 2018 9 commits
- qa: add a log in testBackupNodeLost for easier debugging · 365c4398
  Julien Muchembled authored May 24, 2018
  
  365c4398
- Document that the bug when checking replicas may also cause the master to crash · e7c2051f
  Julien Muchembled authored May 24, 2018
  
  e7c2051f
- storage: stop logging 'Abort TXN' for txn that have been locked · f7cf8f07
  Julien Muchembled authored May 24, 2018
```
It was confusing and there's already the 'Unlock TXN' log just before abort()
is called (in this case, it's more a cleanup than an abort).
```
  f7cf8f07
- storage: split _migrate2() for reusable _alterTable() · d9b98671
  Julien Muchembled authored May 17, 2018
```
Future migration steps are likely to alter tables, possibly with
transformation of data, and this is complicated for both supported backend.
```
  d9b98671
- qa: new testStorageUpgrade · e2dacd6a
  Julien Muchembled authored May 24, 2018
  
  e2dacd6a
- qa: update testStorageUpgrade data for what is not automatically upgraded · 477e0e44
  Julien Muchembled authored May 22, 2018
```
Some changes in the storage format are minor and applying them automatically
would cost too much for big databases.

Here, we apply them manually so that testStorageUpgrade will be able to
compare dumps.

We hope however that with improvements like
  https://jira.mariadb.org/browse/MDEV-12836
we'll be able to implement more migration steps
and revert parts of this commit.
```
  477e0e44
- qa: original data for the future testStorageUpgrade · 933579f5
  Julien Muchembled authored May 22, 2018
```
These dumps were generated with an old version of NEO, plus a backport of the
test that will use them.

In MySQL dumps, --hex-blob was used only for inserts in the 'data' table.
```
  933579f5
- sqlite: fix indexes of upgraded db · 791900c7
  Julien Muchembled authored May 22, 2018
  
  791900c7
- importer: fix NameError when recovering during tpc_finish · 6dcda4e6
  Julien Muchembled authored May 21, 2018
  
  6dcda4e6
17 May, 2018 1 commit

fixup! importer: fetch and process the data to import in a separate process · dc220d04

Julien Muchembled authored May 17, 2018

- for FileStorage DB, make sure a transaction index is built at most once
- for other DB types, reopen the DB in the subprocess

Now that we have specific code for FileStorage, the generic case is not tested
anymore. We should add a test using ZEO. Or better, and in some way crazy,
one with NEO, but one would need to fix a special case in getObject.

dc220d04

16 May, 2018 5 commits

Serialize empty transaction extension with an empty string · a6d4c4e9

Julien Muchembled authored May 15, 2018

The protocol version is increased to ensure that client nodes are able to
handle an empty 'extension' field in AnswerTransactionInformation.

It also means that once new transactions are written, going back to a previous
revision is not possible.

a6d4c4e9

client: fix partial import from a source storage · 346c9d00

Julien Muchembled authored May 15, 2018

The correct way to specify a start/stop tid is when constructing the 'source'
object, hence the remove of start/stop args. In fact, source.iterator()
does not always take such args.

On the other hand, when resuming import, Application.importFrom must manage
with incomplete preindex.

346c9d00

qa: give a title to subprocesses of functional tests · b648904b
Julien Muchembled authored May 07, 2018
```
Same as previous commit: only cosmetics so optional.
```
b648904b

importer: give a title to the 'import' and 'writeback' subprocesses · 461df152

Julien Muchembled authored May 07, 2018

'title' means both process name and command line.

This is cosmetics so it won't fail if the 'setproctitle' module
is not available.

461df152

importer: fetch and process the data to import in a separate process · 05bf48de

Julien Muchembled authored May 02, 2018

A new subprocess is used to:
- fetch data from the source DB
- repickle to change oids (when merging several DB)
- compress
- checksum

This is mostly useful for the second step, which is relatively much slower than
any other step, while not releasing the GIL.

By using a second CPU core, it is also often possible to use a better
compression algorithm for free (e.g. zlib=9). Actually, smaller data can speed
up the writing process.

In addition to greatly speed up the import by parallelizing fetch+process with
write, it also makes the main process more reactive to queries from client
nodes.

05bf48de

15 May, 2018 1 commit

importer: new option to write back new transactions to the source database · 30a02bdc

Julien Muchembled authored Apr 19, 2018

By doing the work with secondary connections to the underlying databases,
asynchronously and in a separate process, this should have minimal impact on
the performance of the storage node. Extra complexity comes from backends that
may lose connection to the database (here MySQL): this commit fully implements
reconnection.

30a02bdc

11 May, 2018 3 commits
- importer: log when the transaction index for FileStorage DB is built · 2fae3e54
  Julien Muchembled authored Apr 19, 2018
  
  2fae3e54
- importer: open imported zodb in read-only whenever possible · db20bf37
  Julien Muchembled authored Apr 16, 2018
```
For FileStorage DB, this avoids:
- keeping a lock on the source DB during the whole import,
- saving the whole index when the import was resumed.
```
  db20bf37
- fixup! mysql: fix remaining places where a server disconnection was not catched · 26f898c1
  Julien Muchembled authored May 09, 2018
  
  26f898c1
07 May, 2018 4 commits
- fixup! storage: speed up replication by sending bigger network packets · 1a064725
  Julien Muchembled authored May 07, 2018
  
  1a064725
- mysql: do not full-scan for duplicates of big oids if deduplication is disabled · 156da51c
  Julien Muchembled authored Apr 19, 2018
  
  156da51c
- mysql: fix remaining places where a server disconnection was not catched · a63b45fe
  Julien Muchembled authored Apr 19, 2018
  
  a63b45fe
- fixup! Add support for custom compression levels · fec86e26
  Julien Muchembled authored May 04, 2018
  
  fec86e26