1. 30 May, 2018 1 commit
    • Julien Muchembled's avatar
      Optimize resumption of replication by starting from a greater TID · b3dd6973
      Julien Muchembled authored
      Although data that are already transferred aren't transferred again, checking
      that the data are there for a whole partition can still be a lot of work for
      big databases. This commit is a major performance improvement in that a storage
      node that gets disconnected for a short time now gets fully operational quite
      instantaneously because it only has to replicate the new data. Before, the time
      to recover depended on the size of the DB.
      
      For OUT_OF_DATE cells, the difficult part was that they are writable and
      can then contain holes, so we can't just take the last TID in trans/obj
      (we wrongly did that at the beginning, and then committed
      6b1f198f as a workaround). We solve that
      by storing up to where it was up-to-date: this value is initialized from
      the last TIDs in trans/obj when the state switches from UP_TO_DATE/FEEDING.
      
      There's actually one such OUT_OF_DATE TID per assigned cell (backends store
      these values in the 'pt' table). Otherwise, a cell that still has a lot to
      replicate would still cause all other cells to resume from the a very small
      TID, or even ZERO_TID; the worse case is when a new cell is assigned to a node
      (as a result of tweak).
      
      For UP_TO_DATE cells of a backup cluster, replication was resumed from the
      maximum TID at which all assigned cells are known to be fully replicated.
      Like for OUT_OF_DATE cells, the presence of a late cell could cause a lot of
      extra work for others, the worst case being when setting up a backup cluster
      (it always restarted from ZERO_TID as long as at least 1 cell was still empty).
      Because UP_TO_DATE cells are guaranteed to have no holes, there's no need to
      store extra information: we simply look at the last TIDs in trans/obj.
      We even handle trans & obj independently, to minimize the work in 1 table
      (i.e. trans since it's processed first) if the other is late (obj).
      
      There's a small change in the protocol so that OUT_OF_DATE enum value equals 0.
      This way, backends can store the OUT_OF_DATE TID (verbatim) in the same column
      as the cell state.
      
      Note about MySQL changes in commit ca58ccd7:
      what we did as a workaround is not one any more. Now, we do so much on Python
      side that it's unlikely we could reduce the number of queries using GROUP BY.
      We even stopped doing that for SQLite.
      b3dd6973
  2. 24 May, 2018 3 commits
  3. 08 Jan, 2018 1 commit
    • Julien Muchembled's avatar
      storage: optimize storage layout of raw data for replication · f4dd4bab
      Julien Muchembled authored
      # Previous status
      
      The issue was that we had extreme storage fragmentation from the point of view
      of the replication algorithm, which processes one partition at a time.
      
      By using an autoincrement for the 'data' table, rows were ordered by the time
      at which they were added:
      - parts may be the result of replication -> ordered by partition, tid, oid
      - other rows are globally sorted by tid
      
      Which means that when scanning a given partition, many rows were skipped all
      the time:
      - if readahead is bigger enough, the efficiency is 1/N for a node with N
        partitions assigned
      - else, it is worse because it seeks all the time
      
      For huge databases, the replication was horribly slow, in particular from HDD.
      
      # Chosen solution
      
      This commit changes how ids are generated to somehow split 'data'
      per partition. The backend tracks 1 last id per assigned partition, where the
      16 higher bits contains the partition. Keep in mind that the value of id has no
      meaning and it's only chosen for performance reasons. IOW, a row can be
      referred by an oid of a partition different than the 16 higher bits of id:
      - there's no migration needed and the 16 higher bits of all existing rows are 0
      - in case of deduplication, a row can still be shared by different partitions
      
      Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
      on existing databases.
      
      ## Downsides
      
      On insertion, increasing the number of partitions now slows down significantly:
      for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
      partitions, the difference remains negligible. The solution for this issue will
      be to enable to increase the number of partitions efficiently, so that nodes
      can keep a small number of them, even for DB that are expected to grow so much
      that many nodes are added over time: such feature was already considered so
      that users don't have to worry anymore about this obscure setting at database
      creation.
      
      Read performance is only slowed down for applications that read a lot of data
      that were written contiguously, but split in small blocks. A solution is to
      extend ZODB so that the application tells it to chose new oids that will end up
      in the same partition. Like for insertion, there should not be too many
      partitions.
      
      With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
      collect all last ids at startup when there are many partitions.
      
      ## Other advantages
      
      - The storage layout of data is now always the same and does not depend on
        whether rows came from replication or commits.
      - Efficient deletion of partition to free space in-place will be possible.
      
      # Considered alternative
      
      The only serious alternative was to replicate as many partitions as possible at
      the same time, ideally all assigned partitions, but it's not always possible.
      For best performance, it would often require to synchronize new nodes, or even
      all of them, so that thesource nodes don't have to scan 'data' several times.
      
      If existing nodes are kept, all data that aren't copied to the newly added
      nodes have to be skipped. If the number of nodes is multiplied by N, the
      efficiency is 1-1/N at best (synchronized nodes), else it's even worse
      because partitions are somehow shuffled.
      
      Checking/replacing a single node would remain slow when there are several
      source nodes.
      
      At last, such an algorithm would be much more complex and we would not have the
      other advantages listed above.
      f4dd4bab
  4. 05 Jan, 2018 4 commits
  5. 05 Dec, 2017 1 commit
  6. 17 Nov, 2017 1 commit
  7. 07 Nov, 2017 1 commit
  8. 12 Jun, 2017 1 commit
  9. 12 May, 2017 1 commit
  10. 04 May, 2017 1 commit
  11. 31 Mar, 2017 2 commits
  12. 21 Feb, 2017 1 commit
    • Julien Muchembled's avatar
      Implement deadlock avoidance · 092992db
      Julien Muchembled authored
      This is a first version with several optimizations possible:
      - improve EventQueue (or implement a specific queue) to minimize deadlocks
      - turn the RebaseObject packet into a notification
      
      Sorting oids could also be useful to reduce the probability of deadlocks,
      but that would never be enough to avoid them completely, even if there's a
      single storage. For example:
      
      1. C1 does a first store (x or y)
      2. C2 stores x and y; one is delayed
      3. C1 stores the other -> deadlock
         When solving the deadlock, the data of the first store may only
         exist on the storage.
      
      2 functional tests are removed because they're redundant,
      either with ZODB tests or with the new threaded tests.
      092992db
  13. 18 Jan, 2017 1 commit
  14. 26 Dec, 2016 3 commits
  15. 01 Dec, 2016 1 commit
  16. 04 Mar, 2016 2 commits
  17. 25 Jan, 2016 1 commit
  18. 13 Dec, 2015 1 commit
  19. 30 Nov, 2015 2 commits
    • Julien Muchembled's avatar
      Minimize the amount of work during tpc_finish · 7eb7cf1b
      Julien Muchembled authored
      NEO did not ensure that all data and metadata are written on disk before
      tpc_finish, and it was for example vulnerable to ENOSPC errors.
      In other words, some work had to be moved to tpc_vote:
      
      - In tpc_vote, all involved storage nodes are now asked to write all metadata
        to ttrans/tobj and _commit_. Because the final tid is not known yet, the tid
        column of ttrans and tobj now contains NULL and the ttid respectively.
      
      - In tpc_finish, AskLockInformation is still required for read locking,
        ttrans.tid is updated with the final value and this change is _committed_.
      
      - The verification phase is greatly simplified, more reliable and faster. For
        all voted transactions, we can know if a tpc_finish was started by getting
        the final tid from the ttid, either from ttrans or from trans. And we know
        that such transactions can't be partial so we don't need to check oids.
      
      So in addition to minimizing the risk of failures during tpc_finish, we also
      fix a bug causing the verification phase to discard transactions with objects
      for which readCurrent was called.
      
      On performance side:
      
      - Although tpc_vote now asks all involved storages, instead of only those
        storing the transaction metadata, the client has been improved to do this
        in parallel. The additional commits are also all done in parallel.
      
      - A possible improvement to compensate the additional commits is to delay the
        commit done by the unlock.
      
      - By minimizing the time to lock transactions, objects are read-locked for a
        much shorter period. This is even more important that locked transactions
        must be unlocked in the same order.
      
      Transactions with too many modified objects will now timeout inside tpc_vote
      instead of tpc_finish. Of course, such transactions may still cause other
      transaction to timeout in tpc_finish.
      7eb7cf1b
    • Julien Muchembled's avatar
      fixup! storage: fix pruning of data when deleting partial transactions during verification · cff279af
      Julien Muchembled authored
      This fixes a regression in commit 83fe64bf
      when ttrans has several rows to the same data_id.
      cff279af
  20. 25 Nov, 2015 1 commit
    • Julien Muchembled's avatar
      Fix 2 'except' statements that will bug when moving to Python 3 · 0d36de7b
      Julien Muchembled authored
      Previous code relied on the fact that the exception target is kept past
      the end of the except clause. 2to3 is not smart enough to detect that.
      
      Without this change, a different OperationalError exception would be
      ignored because there's already a local variable of the same name.
      0d36de7b
  21. 03 Nov, 2015 1 commit
  22. 26 Oct, 2015 1 commit
  23. 02 Oct, 2015 1 commit
  24. 24 Jun, 2015 1 commit
    • Julien Muchembled's avatar
      Fix read of empty transactions · f8ce322b
      Julien Muchembled authored
      Since transactions have metadata like a description, it may not be useless
      to allow them. But the behaviour of FileStorage is to silently drop them,
      so we may have to do the same in the future.
      
      An application that is not supposed to commit empty transactions should write
      its own unit test to prevent this.
      f8ce322b
  25. 09 Jun, 2015 2 commits
  26. 21 May, 2015 1 commit
  27. 30 Jul, 2014 1 commit
  28. 22 Jul, 2014 1 commit
  29. 08 Jul, 2014 1 commit