• In the Importer storage backend, the repickler code never really worked with
    ZODB 5 (use of protocol > 1), and now the test does not pass anymore.
    
    The other issues caused by ZODB commit 12ee41c47310156027a674932df34b60de86ba36
    are fixed:
    
      TypeError: list indices must be integers, not binary
    
      ValueError: unsupported pickle protocol: 3
    
    Although not necessary as long as we don't support Python 3,
    this commit also replaces `str` by `bytes` in a few places.
    by Julien Muchembled
     
    Browse Files
  • by Julien Muchembled
     
    Browse Files
  • When importing a FileStorage DB without interruption and without having to
    serve client nodes, the index built by speedupFileStorageTxnLookup is useless.
    Such case happens when doing simulation tests and on DB with many oids,
    it can take a lot of time and memory for nothing.
    by Julien Muchembled
     
    Browse Files




  • This is a follow-up of commit 2ca7c335,
    which changed 'tweak' not to discard readable cells too quickly.
    
    The scenario of a storage being lost whereas it has feeding cells was forgotten.
    These must be discarded immediately, otherwise we end up with more up-to-date
    cells than wanted. Without the change in outdate(), testSafeTweak would end
    with: UU.|U.U|UUU
    
    Once replication is optimized not to always restart checking cells from the
    beginning:
    - Remembering that an out-of-date cell was feeding could be a safer
      option, but it may not be worth the extra complexity.
    - Another possibility may be to replace the FEEDING state by an automatic
      partial tweak that only discards up-to-date cells too many whenever a cell
      becomes up-to-date.
    by Julien Muchembled
     
    Browse Files
  • by Julien Muchembled
     
    Browse Files


  • For records that undo object creation, None values are used at the backend
    level whereas the protocol is not designed to serialize None for any field.
    
    Therefore, a dance done in many places around packet serialization, using the
    specific 0/ZERO_HASH/'' triplet to represent a deleted oid. For replication,
    it was missing at the sender side, leading to the following crash:
    
      Traceback (most recent call last):
        File "neo/storage/app.py", line 147, in run
          self._run()
        File "neo/storage/app.py", line 178, in _run
          self.doOperation()
        File "neo/storage/app.py", line 257, in doOperation
          next(task_queue[-1]) or task_queue.rotate()
        File "neo/storage/handlers/storage.py", line 271, in push
          conn.send(Packets.AddObject(oid, *object), msg_id)
        File "neo/lib/protocol.py", line 234, in __init__
          self._fmt.encode(buf.write, args)
        File "neo/lib/protocol.py", line 345, in encode
          return self._trace(self._encode, writer, items)
        File "neo/lib/protocol.py", line 334, in _trace
          return method(*args)
        File "neo/lib/protocol.py", line 367, in _encode
          item.encode(writer, value)
        File "neo/lib/protocol.py", line 345, in encode
          return self._trace(self._encode, writer, items)
        File "neo/lib/protocol.py", line 342, in _trace
          raise ParseError(self, trace)
      ParseError: at add_object/checksum:
        File "neo/lib/protocol.py", line 553, in _encode
          assert len(checksum) == 20, (len(checksum), checksum)
      TypeError: object of type 'NoneType' has no len()
    by Julien Muchembled
     
    Browse Files


  • Before, it waited for upstream activity until all partitions are touched.
    However, when upstream is idle the backup cluster could remain stuck forever
    if it was interrupted whereas some cells were still late.
    by Julien Muchembled
     
    Browse Files
  • The 'min_tid < new_tid' assertion failed when jumping to the past.
    by Julien Muchembled
     
    Browse Files
  • Given that:
    - read locks are only taken by transactions (not replication)
    - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
      are synchronized up to different tids
    
    there was a race condition with the master node replying to LastTransaction
    with a TID that may not be replicated yet by all replicas, potentially causing
    such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
    too early.
    
    IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
    it is only readable by NEO clients up to `getBackupTid(min)` as long as the
    cluster is in BACKINGUP state.
    by Julien Muchembled
     
    Browse Files



  • # Previous status
    
    The issue was that we had extreme storage fragmentation from the point of view
    of the replication algorithm, which processes one partition at a time.
    
    By using an autoincrement for the 'data' table, rows were ordered by the time
    at which they were added:
    - parts may be the result of replication -> ordered by partition, tid, oid
    - other rows are globally sorted by tid
    
    Which means that when scanning a given partition, many rows were skipped all
    the time:
    - if readahead is bigger enough, the efficiency is 1/N for a node with N
      partitions assigned
    - else, it is worse because it seeks all the time
    
    For huge databases, the replication was horribly slow, in particular from HDD.
    
    # Chosen solution
    
    This commit changes how ids are generated to somehow split 'data'
    per partition. The backend tracks 1 last id per assigned partition, where the
    16 higher bits contains the partition. Keep in mind that the value of id has no
    meaning and it's only chosen for performance reasons. IOW, a row can be
    referred by an oid of a partition different than the 16 higher bits of id:
    - there's no migration needed and the 16 higher bits of all existing rows are 0
    - in case of deduplication, a row can still be shared by different partitions
    
    Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
    on existing databases.
    
    ## Downsides
    
    On insertion, increasing the number of partitions now slows down significantly:
    for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
    partitions, the difference remains negligible. The solution for this issue will
    be to enable to increase the number of partitions efficiently, so that nodes
    can keep a small number of them, even for DB that are expected to grow so much
    that many nodes are added over time: such feature was already considered so
    that users don't have to worry anymore about this obscure setting at database
    creation.
    
    Read performance is only slowed down for applications that read a lot of data
    that were written contiguously, but split in small blocks. A solution is to
    extend ZODB so that the application tells it to chose new oids that will end up
    in the same partition. Like for insertion, there should not be too many
    partitions.
    
    With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
    collect all last ids at startup when there are many partitions.
    
    ## Other advantages
    
    - The storage layout of data is now always the same and does not depend on
      whether rows came from replication or commits.
    - Efficient deletion of partition to free space in-place will be possible.
    
    # Considered alternative
    
    The only serious alternative was to replicate as many partitions as possible at
    the same time, ideally all assigned partitions, but it's not always possible.
    For best performance, it would often require to synchronize new nodes, or even
    all of them, so that thesource nodes don't have to scan 'data' several times.
    
    If existing nodes are kept, all data that aren't copied to the newly added
    nodes have to be skipped. If the number of nodes is multiplied by N, the
    efficiency is 1-1/N at best (synchronized nodes), else it's even worse
    because partitions are somehow shuffled.
    
    Checking/replacing a single node would remain slow when there are several
    source nodes.
    
    At last, such an algorithm would be much more complex and we would not have the
    other advantages listed above.
    by Julien Muchembled
     
    Browse Files