1. 02 Mar, 2018 1 commit
    • Julien Muchembled's avatar
      master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061
      Julien Muchembled authored
      Given that:
      - read locks are only taken by transactions (not replication)
      - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
        are synchronized up to different tids
      
      there was a race condition with the master node replying to LastTransaction
      with a TID that may not be replicated yet by all replicas, potentially causing
      such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
      too early.
      
      IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
      it is only readable by NEO clients up to `getBackupTid(min)` as long as the
      cluster is in BACKINGUP state.
      ca2f7061
  2. 17 Jan, 2018 1 commit
  3. 11 Jan, 2018 1 commit
  4. 08 Jan, 2018 1 commit
    • Julien Muchembled's avatar
      storage: optimize storage layout of raw data for replication · f4dd4bab
      Julien Muchembled authored
      # Previous status
      
      The issue was that we had extreme storage fragmentation from the point of view
      of the replication algorithm, which processes one partition at a time.
      
      By using an autoincrement for the 'data' table, rows were ordered by the time
      at which they were added:
      - parts may be the result of replication -> ordered by partition, tid, oid
      - other rows are globally sorted by tid
      
      Which means that when scanning a given partition, many rows were skipped all
      the time:
      - if readahead is bigger enough, the efficiency is 1/N for a node with N
        partitions assigned
      - else, it is worse because it seeks all the time
      
      For huge databases, the replication was horribly slow, in particular from HDD.
      
      # Chosen solution
      
      This commit changes how ids are generated to somehow split 'data'
      per partition. The backend tracks 1 last id per assigned partition, where the
      16 higher bits contains the partition. Keep in mind that the value of id has no
      meaning and it's only chosen for performance reasons. IOW, a row can be
      referred by an oid of a partition different than the 16 higher bits of id:
      - there's no migration needed and the 16 higher bits of all existing rows are 0
      - in case of deduplication, a row can still be shared by different partitions
      
      Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
      on existing databases.
      
      ## Downsides
      
      On insertion, increasing the number of partitions now slows down significantly:
      for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
      partitions, the difference remains negligible. The solution for this issue will
      be to enable to increase the number of partitions efficiently, so that nodes
      can keep a small number of them, even for DB that are expected to grow so much
      that many nodes are added over time: such feature was already considered so
      that users don't have to worry anymore about this obscure setting at database
      creation.
      
      Read performance is only slowed down for applications that read a lot of data
      that were written contiguously, but split in small blocks. A solution is to
      extend ZODB so that the application tells it to chose new oids that will end up
      in the same partition. Like for insertion, there should not be too many
      partitions.
      
      With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
      collect all last ids at startup when there are many partitions.
      
      ## Other advantages
      
      - The storage layout of data is now always the same and does not depend on
        whether rows came from replication or commits.
      - Efficient deletion of partition to free space in-place will be possible.
      
      # Considered alternative
      
      The only serious alternative was to replicate as many partitions as possible at
      the same time, ideally all assigned partitions, but it's not always possible.
      For best performance, it would often require to synchronize new nodes, or even
      all of them, so that thesource nodes don't have to scan 'data' several times.
      
      If existing nodes are kept, all data that aren't copied to the newly added
      nodes have to be skipped. If the number of nodes is multiplied by N, the
      efficiency is 1-1/N at best (synchronized nodes), else it's even worse
      because partitions are somehow shuffled.
      
      Checking/replacing a single node would remain slow when there are several
      source nodes.
      
      At last, such an algorithm would be much more complex and we would not have the
      other advantages listed above.
      f4dd4bab
  5. 05 Jan, 2018 6 commits
  6. 21 Dec, 2017 1 commit
  7. 15 Dec, 2017 3 commits
  8. 13 Dec, 2017 1 commit
  9. 11 Dec, 2017 3 commits
  10. 05 Dec, 2017 2 commits
  11. 04 Dec, 2017 1 commit
  12. 21 Nov, 2017 1 commit
    • Julien Muchembled's avatar
      client: bug found, add log to collect more information · a1082cbc
      Julien Muchembled authored
      INFO Z2 Log files reopened successfully
      INFO SignalHandler Caught signal SIGTERM
      INFO Z2 Shutting down fast
      INFO ZServer closing HTTP to new connections
      ERROR ZODB.Connection Couldn't load state for BTrees.LOBTree.LOBucket 0xc12e29
      Traceback (most recent call last):
        File "ZODB/Connection.py", line 909, in setstate
          self._setstate(obj, oid)
        File "ZODB/Connection.py", line 953, in _setstate
          p, serial = self._storage.load(oid, '')
        File "neo/client/Storage.py", line 81, in load
          return self.app.load(oid)[:2]
        File "neo/client/app.py", line 355, in load
          data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
        File "neo/client/app.py", line 387, in _loadFromStorage
          askStorage)
        File "neo/client/app.py", line 297, in _askStorageForRead
          self.sync()
        File "neo/client/app.py", line 898, in sync
          self._askPrimary(Packets.Ping())
        File "neo/client/app.py", line 163, in _askPrimary
          return self._ask(self._getMasterConnection(), packet,
        File "neo/client/app.py", line 177, in _getMasterConnection
          result = self.master_conn = self._connectToPrimaryNode()
        File "neo/client/app.py", line 202, in _connectToPrimaryNode
          index = (index + 1) % len(master_list)
      ZeroDivisionError: integer division or modulo by zero
      a1082cbc
  13. 19 Nov, 2017 1 commit
  14. 17 Nov, 2017 4 commits
  15. 15 Nov, 2017 1 commit
  16. 07 Nov, 2017 2 commits
  17. 27 Oct, 2017 1 commit
  18. 29 Sep, 2017 3 commits
  19. 11 Sep, 2017 1 commit
  20. 05 Sep, 2017 1 commit
  21. 28 Aug, 2017 1 commit
  22. 11 Jul, 2017 1 commit
  23. 04 Jul, 2017 2 commits