1. 16 May, 2018 5 commits
    • Julien Muchembled's avatar
      Serialize empty transaction extension with an empty string · a6d4c4e9
      Julien Muchembled authored
      The protocol version is increased to ensure that client nodes are able to
      handle an empty 'extension' field in AnswerTransactionInformation.
      
      It also means that once new transactions are written, going back to a previous
      revision is not possible.
      a6d4c4e9
    • Julien Muchembled's avatar
      client: fix partial import from a source storage · 346c9d00
      Julien Muchembled authored
      The correct way to specify a start/stop tid is when constructing the 'source'
      object, hence the remove of start/stop args. In fact, source.iterator()
      does not always take such args.
      
      On the other hand, when resuming import, Application.importFrom must manage
      with incomplete preindex.
      346c9d00
    • Julien Muchembled's avatar
      qa: give a title to subprocesses of functional tests · b648904b
      Julien Muchembled authored
      Same as previous commit: only cosmetics so optional.
      b648904b
    • Julien Muchembled's avatar
      importer: give a title to the 'import' and 'writeback' subprocesses · 461df152
      Julien Muchembled authored
      'title' means both process name and command line.
      
      This is cosmetics so it won't fail if the 'setproctitle' module
      is not available.
      461df152
    • Julien Muchembled's avatar
      importer: fetch and process the data to import in a separate process · 05bf48de
      Julien Muchembled authored
      A new subprocess is used to:
      - fetch data from the source DB
      - repickle to change oids (when merging several DB)
      - compress
      - checksum
      
      This is mostly useful for the second step, which is relatively much slower than
      any other step, while not releasing the GIL.
      
      By using a second CPU core, it is also often possible to use a better
      compression algorithm for free (e.g. zlib=9). Actually, smaller data can speed
      up the writing process.
      
      In addition to greatly speed up the import by parallelizing fetch+process with
      write, it also makes the main process more reactive to queries from client
      nodes.
      05bf48de
  2. 15 May, 2018 1 commit
    • Julien Muchembled's avatar
      importer: new option to write back new transactions to the source database · 30a02bdc
      Julien Muchembled authored
      By doing the work with secondary connections to the underlying databases,
      asynchronously and in a separate process, this should have minimal impact on
      the performance of the storage node. Extra complexity comes from backends that
      may lose connection to the database (here MySQL): this commit fully implements
      reconnection.
      30a02bdc
  3. 11 May, 2018 3 commits
  4. 07 May, 2018 4 commits
  5. 18 Apr, 2018 3 commits
  6. 16 Apr, 2018 3 commits
    • Julien Muchembled's avatar
      Fix a few issues with ZODB5 · 1316c225
      Julien Muchembled authored
      In the Importer storage backend, the repickler code never really worked with
      ZODB 5 (use of protocol > 1), and now the test does not pass anymore.
      
      The other issues caused by ZODB commit 12ee41c47310156027a674932df34b60de86ba36
      are fixed:
      
        TypeError: list indices must be integers, not binary
      
        ValueError: unsupported pickle protocol: 3
      
      Although not necessary as long as we don't support Python 3,
      this commit also replaces `str` by `bytes` in a few places.
      1316c225
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      importer: do not trigger speedupFileStorageTxnLookup uselessly · 3bcac6d3
      Julien Muchembled authored
      When importing a FileStorage DB without interruption and without having to
      serve client nodes, the index built by speedupFileStorageTxnLookup is useless.
      Such case happens when doing simulation tests and on DB with many oids,
      it can take a lot of time and memory for nothing.
      3bcac6d3
  7. 13 Apr, 2018 2 commits
  8. 12 Apr, 2018 2 commits
  9. 10 Apr, 2018 1 commit
  10. 29 Mar, 2018 2 commits
    • Julien Muchembled's avatar
      master: automatically discard feeding cells that get out-of-date · 3efbbfe3
      Julien Muchembled authored
      This is a follow-up of commit 2ca7c335,
      which changed 'tweak' not to discard readable cells too quickly.
      
      The scenario of a storage being lost whereas it has feeding cells was forgotten.
      These must be discarded immediately, otherwise we end up with more up-to-date
      cells than wanted. Without the change in outdate(), testSafeTweak would end
      with: UU.|U.U|UUU
      
      Once replication is optimized not to always restart checking cells from the
      beginning:
      - Remembering that an out-of-date cell was feeding could be a safer
        option, but it may not be worth the extra complexity.
      - Another possibility may be to replace the FEEDING state by an automatic
        partial tweak that only discards up-to-date cells too many whenever a cell
        becomes up-to-date.
      3efbbfe3
    • Julien Muchembled's avatar
      3443d483
  11. 20 Mar, 2018 2 commits
  12. 14 Mar, 2018 1 commit
    • Julien Muchembled's avatar
      storage: fix replication of creation undone · c3343279
      Julien Muchembled authored
      For records that undo object creation, None values are used at the backend
      level whereas the protocol is not designed to serialize None for any field.
      
      Therefore, a dance done in many places around packet serialization, using the
      specific 0/ZERO_HASH/'' triplet to represent a deleted oid. For replication,
      it was missing at the sender side, leading to the following crash:
      
        Traceback (most recent call last):
          File "neo/storage/app.py", line 147, in run
            self._run()
          File "neo/storage/app.py", line 178, in _run
            self.doOperation()
          File "neo/storage/app.py", line 257, in doOperation
            next(task_queue[-1]) or task_queue.rotate()
          File "neo/storage/handlers/storage.py", line 271, in push
            conn.send(Packets.AddObject(oid, *object), msg_id)
          File "neo/lib/protocol.py", line 234, in __init__
            self._fmt.encode(buf.write, args)
          File "neo/lib/protocol.py", line 345, in encode
            return self._trace(self._encode, writer, items)
          File "neo/lib/protocol.py", line 334, in _trace
            return method(*args)
          File "neo/lib/protocol.py", line 367, in _encode
            item.encode(writer, value)
          File "neo/lib/protocol.py", line 345, in encode
            return self._trace(self._encode, writer, items)
          File "neo/lib/protocol.py", line 342, in _trace
            raise ParseError(self, trace)
        ParseError: at add_object/checksum:
          File "neo/lib/protocol.py", line 553, in _encode
            assert len(checksum) == 20, (len(checksum), checksum)
        TypeError: object of type 'NoneType' has no len()
      c3343279
  13. 13 Mar, 2018 1 commit
  14. 02 Mar, 2018 3 commits
    • Julien Muchembled's avatar
      master: fix resumption of backup replication (internal or not) · 27229793
      Julien Muchembled authored
      Before, it waited for upstream activity until all partitions are touched.
      However, when upstream is idle the backup cluster could remain stuck forever
      if it was interrupted whereas some cells were still late.
      27229793
    • Julien Muchembled's avatar
      master: fix/simplify generation of TID · 7b2e6752
      Julien Muchembled authored
      The 'min_tid < new_tid' assertion failed when jumping to the past.
      7b2e6752
    • Julien Muchembled's avatar
      master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061
      Julien Muchembled authored
      Given that:
      - read locks are only taken by transactions (not replication)
      - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
        are synchronized up to different tids
      
      there was a race condition with the master node replying to LastTransaction
      with a TID that may not be replicated yet by all replicas, potentially causing
      such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
      too early.
      
      IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
      it is only readable by NEO clients up to `getBackupTid(min)` as long as the
      cluster is in BACKINGUP state.
      ca2f7061
  15. 17 Jan, 2018 1 commit
  16. 11 Jan, 2018 1 commit
  17. 08 Jan, 2018 1 commit
    • Julien Muchembled's avatar
      storage: optimize storage layout of raw data for replication · f4dd4bab
      Julien Muchembled authored
      # Previous status
      
      The issue was that we had extreme storage fragmentation from the point of view
      of the replication algorithm, which processes one partition at a time.
      
      By using an autoincrement for the 'data' table, rows were ordered by the time
      at which they were added:
      - parts may be the result of replication -> ordered by partition, tid, oid
      - other rows are globally sorted by tid
      
      Which means that when scanning a given partition, many rows were skipped all
      the time:
      - if readahead is bigger enough, the efficiency is 1/N for a node with N
        partitions assigned
      - else, it is worse because it seeks all the time
      
      For huge databases, the replication was horribly slow, in particular from HDD.
      
      # Chosen solution
      
      This commit changes how ids are generated to somehow split 'data'
      per partition. The backend tracks 1 last id per assigned partition, where the
      16 higher bits contains the partition. Keep in mind that the value of id has no
      meaning and it's only chosen for performance reasons. IOW, a row can be
      referred by an oid of a partition different than the 16 higher bits of id:
      - there's no migration needed and the 16 higher bits of all existing rows are 0
      - in case of deduplication, a row can still be shared by different partitions
      
      Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
      on existing databases.
      
      ## Downsides
      
      On insertion, increasing the number of partitions now slows down significantly:
      for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
      partitions, the difference remains negligible. The solution for this issue will
      be to enable to increase the number of partitions efficiently, so that nodes
      can keep a small number of them, even for DB that are expected to grow so much
      that many nodes are added over time: such feature was already considered so
      that users don't have to worry anymore about this obscure setting at database
      creation.
      
      Read performance is only slowed down for applications that read a lot of data
      that were written contiguously, but split in small blocks. A solution is to
      extend ZODB so that the application tells it to chose new oids that will end up
      in the same partition. Like for insertion, there should not be too many
      partitions.
      
      With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
      collect all last ids at startup when there are many partitions.
      
      ## Other advantages
      
      - The storage layout of data is now always the same and does not depend on
        whether rows came from replication or commits.
      - Efficient deletion of partition to free space in-place will be possible.
      
      # Considered alternative
      
      The only serious alternative was to replicate as many partitions as possible at
      the same time, ideally all assigned partitions, but it's not always possible.
      For best performance, it would often require to synchronize new nodes, or even
      all of them, so that thesource nodes don't have to scan 'data' several times.
      
      If existing nodes are kept, all data that aren't copied to the newly added
      nodes have to be skipped. If the number of nodes is multiplied by N, the
      efficiency is 1-1/N at best (synchronized nodes), else it's even worse
      because partitions are somehow shuffled.
      
      Checking/replacing a single node would remain slow when there are several
      source nodes.
      
      At last, such an algorithm would be much more complex and we would not have the
      other advantages listed above.
      f4dd4bab
  18. 05 Jan, 2018 4 commits