Commits · 3a14526f64862b7d3a1c6a0d558cb19c9e17f78c · Vincent Pelletier / neoppod

21 May, 2024 3 commits
- SQUASH Use MAX_TID as stand-in for "no known First TID" · 3a14526f
  Vincent Pelletier authored May 21, 2024
```
This simplifies the code a lot.
```
  3a14526f
- SQUASH Fix storage.database.importer getFirstTID · 08ce5273
  Vincent Pelletier authored May 21, 2024
  
  08ce5273
- SQUASH master.transaction: Avoid the overhead of min_tid on every _unlockPending · 59b0f182
  Vincent Pelletier authored May 17, 2024
```
Instead, mask that method when the transaction manager is reset to call
this method, which calls setMinTID, then unmasks and calls the original
_unlockPending.
```
  59b0f182
17 May, 2024 1 commit

SQUASH Apply part of first review · 61fa401a

Vincent Pelletier authored May 17, 2024

Mark a line to be folded back once migrated to python 3.
Make storage.database.manager's getFirstTID return the tid packed.
Also, update docstring to stop saying the result is unpacked.
Also, fix None handling in the storage.database.manager and in
master.transaction.

61fa401a

16 May, 2024 2 commits

master: Forbid truncature before database's first transaction · 6dffb894

Vincent Pelletier authored May 16, 2024

This is intended as a sanity check, so simple typos in neoctl truncate
command do not easily lead to the entire database being wiped.

6dffb894

neoctl: Change the expected tid-or-timestamp format · f70a688c

Vincent Pelletier authored May 16, 2024

Before this change, the only distinction between a timestamp and a TID was
the presence of the decimal separator, ".". As a result, a timestamp
mistakenly provided without a decimal separator would be interpreted as a
TID, which will be somewhere in January 1900 (as TIDs are 64bits with much
finer accuracy than timestamps). When used to truncate a database, and in
the absence of sanity checks, this would simply wipe the database.

So, instead of just relying on a decimal separator, require a longer
string. Make it a prefix for readability. Also, TIDs are more niche than
timestamp, require them to have a mark, and do not require anything from
timestamps.

f70a688c

09 May, 2024 4 commits
- sqlite: fix performance issue in replication · c4443632
  Julien Muchembled authored May 09, 2024
  
  c4443632
- sqlite: add support for cksumvfs extension · 5923f8f5
  Julien Muchembled authored May 07, 2024
  
  5923f8f5
- debug: add snippet to ask a storage to commit · 9722d241
  Julien Muchembled authored May 07, 2024
  
  9722d241
- mysql: "disable" wait_timeout · a6edf36b
  Julien Muchembled authored Apr 21, 2024
  
  a6edf36b
16 Apr, 2024 1 commit
- sqlite: accept -d URI, where query string can configure the DB connection · 71564067
  Julien Muchembled authored Apr 10, 2024
  
  71564067
22 Mar, 2024 6 commits
- pack: some cleanup & better error handling · a0280bec
  Julien Muchembled authored Mar 12, 2024
  
  a0280bec
- simple: new --autostart option · 70277a73
  Julien Muchembled authored Mar 11, 2024
  
  70277a73
- mysql: minor optimization · cad01d2e
  Julien Muchembled authored Mar 04, 2024
  
  cad01d2e
- storage: reject transactions that affect too many OIDs (rather than crashing) · 0b414488
  Julien Muchembled authored Feb 26, 2024
  
  0b414488
- New API to iterate over non-deleted OIDs · 071c6bf5
  Julien Muchembled authored Feb 07, 2024
```
This will be used by an external GC.

To be pushed upstream.
```
  071c6bf5
- fixup! Fix use of several EpollEventManager within the same process · ad62c5c7
  Julien Muchembled authored Mar 20, 2024
```
set_wakeup_fd only works in main thread.

See commit f47dd646.
```
  ad62c5c7
22 Feb, 2024 8 commits
- client: check type of 'oid' parameter when loading an object · ffecd4ae
  Julien Muchembled authored Feb 08, 2024
```
Since the switch to msgpack, there's no more type checking at protocol level
and for example passing an integer would cause storage nodes to crash.

But as shown here, the type checking of the old protocol was not always
enough, because data structures at client side could anyway get wrong.
```
  ffecd4ae
- client: new ignore-wrong-checksum option · 20791999
  Julien Muchembled authored Feb 22, 2024
  
  20791999
- doc: only 1 log file per process · 3531ee9e
  Julien Muchembled authored Jan 30, 2024
  
  3531ee9e
- master: prevent importing transaction with invalid TID · ee6e413e
  Julien Muchembled authored Feb 07, 2024
```
Else it leads to DB corruption and a crash of the master.
```
  ee6e413e
- client: raise NEOStorageError instead of POSException.StorageError on protocol error · 9f5e14f1
  Julien Muchembled authored Feb 07, 2024
  
  9f5e14f1
- client: report tid when logging records with wrong checksum · 4039f4da
  Julien Muchembled authored Jan 23, 2024
  
  4039f4da
- Fix use of several EpollEventManager within the same process · f47dd646
  Julien Muchembled authored Jan 22, 2024
```
This fixes commit 0e43dd1f
("Fix signals not always being processed as soon as possible").
```
  f47dd646
- fixup! neoctl: fix exit status code if not ready · 2f760fa5
  Julien Muchembled authored Feb 22, 2024
```
See commit b6f821a2.
```
  2f760fa5
18 Dec, 2023 4 commits

client: Don't allow oPtion_nAme in zurl · 798c9f25

Kirill Smelkov authored Dec 13, 2023

Julien notes this is very likely unneeded:
nexedi/neoppod!21 (diffs, comment 195929)

We had it like this since 01a01c8c (client: Add support for zodburi),
but I rechecked zodburi codebase now and it does not do any similar
lowering anywhere.

So drop support for case normalization in zurl options.

/cc @levin.zimmermann
/reviewed-by @jm
/reviewed-on nexedi/neoppod!21

798c9f25

app: Remember SSL credentials so that it is possible to retrieve them · 17af7f27

Kirill Smelkov authored Dec 12, 2023

Unfortunately after creating SSL context it is not possible, or at least
I could not find how, to retrieve original credentials with which the
context was created. However wendelin.core needs to be able to take a
client storage, reconstruct zurl to refer to that particular storage,
and pass that zurl to wcfs, so that wcfs, in turn, could access the same
ZODB database.

Given a NEO client instance, it is already possible to retrieve
master_nodes, cluster name, and detect whether SSL is being in use.
However without being able to retrieve original SSL credentials,
reconstructed zurl will not be full and wcfs won't be able to use
exactly the same secrets as python part does.

-> Help wendelin.core by remembering which ca/cert/key were used to
build SSL context.

This information is used by zstor_2zurl in wendelin.core here:

https://lab.nexedi.com/nexedi/wendelin.core/blob/885b3556/lib/zodb.py#L390-418

/cc @levin.zimmermann
/reviewed-by @jm
/reviewed-on nexedi/neoppod!21

17af7f27

client: Allow to force TLS via neos:// scheme · bc3e38ea

Kirill Smelkov authored Dec 11, 2023

Similarly to how it is done with e.g. http:// and https:// - if neos://
is given TLS usage is forced and ca/cert/key must be there either in the
URI itself, or in $NEO_CA, $NEO_CERT and $NEO_KEY environment variables
mimicking the way how e.g. for https:// TLS credentials are taken from
host environment, not from the uri.

The latter might be usability convenience, but is also useful for WCFS
which needs to be able to remove secrets from uri on zurl normalization.

Please see discussion at nexedi/neoppod!18 (comment 184439)
for details.

/cc @levin.zimmermann
/reviewed-by @jm
/reviewed-on nexedi/neoppod!21

bc3e38ea

client: Don't allow master_nodes and name to be present in options · 22ccebd6

Kirill Smelkov authored May 18, 2023

Because list of masters and cluster name must be already present in
netloc and path. Previously e.g.

	neo://db@α,β,γ?master_nodes=a,b,c"

would mean to use master nodes {a,b,c} not {α,β,γ}. Now it is treated as
invalid URI to remove ambiguity. Same for cluster name.

/cc @levin.zimmermann
/reviewed-by @jm
/reviewed-on nexedi/neoppod!21

22ccebd6

08 Nov, 2023 1 commit

master: fix crash when aborting early e.g. when failing to open listening socket · 9a3898e4

Julien Muchembled authored Nov 08, 2023

Pre-mortem data:
Traceback (most recent call last):
File "neo/master/app.py", line 172, in run
self._run()
File "neo/master/app.py", line 180, in _run
self.listening_conn = ListeningConnection(self, None, self.server)
File "neo/lib/connection.py", line 298, in __init__
connector.makeListeningConnection()
File "neo/lib/connector.py", line 133, in makeListeningConnection
self._error('listen', e)
File "neo/lib/connector.py", line 93, in _error
raise ConnectorException
ConnectorException
Traceback (most recent call last):
  File "neomaster", line 50, in <module>
    sys.exit(neo.scripts.neomaster.main())
  File "neo/scripts/neomaster.py", line 31, in main
    app.run()
  File "neo/master/app.py", line 175, in run
    self.log()
  File "neo/master/app.py", line 167, in log
    if self.pt is not None:
AttributeError: 'Application' object has no attribute 'pt'

9a3898e4

16 Oct, 2023 5 commits

neoctl: fix exit status code if not ready · b6f821a2
Julien Muchembled authored Oct 10, 2023

b6f821a2
neoctl: do not wait forever if master disconnects · d112bfbd
Julien Muchembled authored Oct 10, 2023

d112bfbd
master: if upstream unset, reject request to backup rather than crashing · 57956ec9
Julien Muchembled authored Oct 10, 2023

57956ec9
Bump protocol version · 0fc95175
Julien Muchembled authored Oct 13, 2023

0fc95175

Reimplement pack in a scalable way, partial pack & approval/reject of pack orders · 4c3b6c4d

Julien Muchembled authored Sep 03, 2020

This is still pack without garbage collection, and without deleting
any transaction metadata ('trans' table).

Partial pack means that the client can take a list of oids: only these
oids will be packed. No API is defined yet at IStorage level.

Storage nodes pack in background, independently from other storage
nodes, partition by partition, and calling IStorage.pack() returns
immediately (though internally, NEO does have a mechanism to wait
until it's done, which can be required for some ZODB unit tests).

This new implementation also introduces the concept of signing pack
orders. The idea is that calling IStorage.pack() only records a pack
order in the database, that can be reviewed/approved/rejected using
an UI that is left to be done. For the moment, pack orders are
automatically approved (by the master).

Internally, pack orders are stored as extra metadata of a transaction.
IOW, IStorage.pack() implies the commit of an (empty) transaction.

IStorage.pack() can be called without waiting for the previous one
to be completed. Pack orders processed in the same order as they are
requested:
- an unsigned pack order blocks the processing of any newer pack order;
- rejected pack order are ignored.

Approving a pack order also triggers pack on backup clusters.
That's the simplest way to have everything consistent.
Maybe later we could identify scenarios where it would be ok
to unsign pack orders during asynchronous replication.

The feature to check replicas is marked as experimental because it is
not aware of differences that can happen during pack operations.
_______________________________________________________________________

About concurrency within the storage node, a first implementation
extended what was done to delete partitions in background (see
previous commit). But here, the job can't be easily split in splices
that are never too big:
- it's simpler to never split the processing of an oid but this can
  freeze the application for a long time when packing an oid that was
  modified many times (e.g. 30 min for an oid with 20 millions
  historical records);
- then an attempt so that an oid can be processed in several times was
  inefficient, maybe due to a limit in RocksDB (packing the oid in the
  above example would take days during which NEO is significantly
  slower).

So background database jobs were moved to a separate thread, using a
separate connection to the underlying database. This is obviously
only useful for the MySQL backend. In order to share as much code as
possible between backends, SQLite also does the work in a separate
thread but sharing the main connection instead of opening a separate
one (so such backend would not be suited in the above example).

But deleting raw data with a secondary connection is not possible
without fsyncing too often (or transaction isolation issues...): these
deletions are deferred by recording them in a new table, which is
processed later with the main connection. This is not so bad because
the actual deletion of raw data is usually more efficient this way
(more sequential IO).

Here are a few numbers:
- without load: 10h45 (12h for the first reimplementation)
- with a load that normally takes 6h58:
  - load: 7h33 (so 8.4% slower)
  - pack: 15h36 (+4h51)

As explained above, the pack of a partition is split in 2 steps:
- the longest one (here 78% without load) should have negligible
  peformance impact on the application because the work is done in a
  separate thread with a secondary connection, and also with something
  to minimize GIL impact by prioritizing the main thread;
- the shortest one (22%) to process the deferred deletions,
  with even lower priority than replication: it tries to split
  the work in tasks that take ~10ms.

4c3b6c4d

11 Oct, 2023 1 commit

storage: delete partitions in a scalable way · 3204a4c6

Julien Muchembled authored May 09, 2017

This is implemented using the same concurrency mechanism as for the
replication: the work is split in slices that should be small enough
to avoid slowing down network requests significantly.

3204a4c6

04 Apr, 2023 4 commits
- undo: code clean-up · fd95a217
  Julien Muchembled authored Mar 16, 2021
```
undone_data_tid can't be equal to a TTID.
```
  fd95a217
- mysql: drop support for horizontal partitioning of trans/obj · 8535b9cc
  Julien Muchembled authored May 09, 2017
```
It has never been enabled and the code to drop partitions will be
changed in a way that only 'trans' may still benefit of partitioning.
We'll see in the future if we have cases where 'trans' is too big to
delete all rows (of a given partition) in a single query.
```
  8535b9cc
- debug: add an example to profile with yappi · fd87e153
  Julien Muchembled authored Mar 27, 2023
  
  fd87e153
- Fix signals not always being processed as soon as possible · 0e43dd1f
  Julien Muchembled authored Mar 21, 2023
  
  0e43dd1f