neo/storage/database/sqlite.py · f4dd4bab839a5f5bf69f74fcf5a34e21f337a1fc · nexedi / neoppod

storage: optimize storage layout of raw data for replication · f4dd4bab
Julien Muchembled authored Nov 23, 2017
# Previous status

The issue was that we had extreme storage fragmentation from the point of view
of the replication algorithm, which processes one partition at a time.

By using an autoincrement for the 'data' table, rows were ordered by the time
at which they were added:
- parts may be the result of replication -> ordered by partition, tid, oid
- other rows are globally sorted by tid

Which means that when scanning a given partition, many rows were skipped all
the time:
- if readahead is bigger enough, the efficiency is 1/N for a node with N
  partitions assigned
- else, it is worse because it seeks all the time

For huge databases, the replication was horribly slow, in particular from HDD.

# Chosen solution

This commit changes how ids are generated to somehow split 'data'
per partition. The backend tracks 1 last id per assigned partition, where the
16 higher bits contains the partition. Keep in mind that the value of id has no
meaning and it's only chosen for performance reasons. IOW, a row can be
referred by an oid of a partition different than the 16 higher bits of id:
- there's no migration needed and the 16 higher bits of all existing rows are 0
- in case of deduplication, a row can still be shared by different partitions

Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
on existing databases.

## Downsides

On insertion, increasing the number of partitions now slows down significantly:
for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
partitions, the difference remains negligible. The solution for this issue will
be to enable to increase the number of partitions efficiently, so that nodes
can keep a small number of them, even for DB that are expected to grow so much
that many nodes are added over time: such feature was already considered so
that users don't have to worry anymore about this obscure setting at database
creation.

Read performance is only slowed down for applications that read a lot of data
that were written contiguously, but split in small blocks. A solution is to
extend ZODB so that the application tells it to chose new oids that will end up
in the same partition. Like for insertion, there should not be too many
partitions.

With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
collect all last ids at startup when there are many partitions.

## Other advantages

- The storage layout of data is now always the same and does not depend on
  whether rows came from replication or commits.
- Efficient deletion of partition to free space in-place will be possible.

# Considered alternative

The only serious alternative was to replicate as many partitions as possible at
the same time, ideally all assigned partitions, but it's not always possible.
For best performance, it would often require to synchronize new nodes, or even
all of them, so that thesource nodes don't have to scan 'data' several times.

If existing nodes are kept, all data that aren't copied to the newly added
nodes have to be skipped. If the number of nodes is multiplied by N, the
efficiency is 1-1/N at best (synchronized nodes), else it's even worse
because partitions are somehow shuffled.

Checking/replacing a single node would remain slow when there are several
source nodes.

At last, such an algorithm would be much more complex and we would not have the
other advantages listed above.
f4dd4bab
sqlite.py 27.6 KB
Replace sqlite.py