1. 28 Nov, 2016 3 commits
  2. 27 Nov, 2016 11 commits
    • Julien Muchembled's avatar
      Bump protocol version · 8eb14b01
      Julien Muchembled authored
      8eb14b01
    • Julien Muchembled's avatar
      Fix identification issues, including a race condition causing id conflicts · 9385706f
      Julien Muchembled authored
      The added test describes how the new id timestamps fix the race condition.
      These timestamps could be any unique opaque values, and the protocol is
      extended to exchange them along with node ids.
      
      Internally, nodes also reuse timestamps as a marker to identify the first
      NotifyNodeInformation packets from the master: since this packet is a complete
      list of nodes in the cluster, any other node in the node manager has left the
      cluster definitely and is removed.
      
      The secondary masters didn't receive update about master nodes.
      It's also useless to send them information about non-master nodes.
      9385706f
    • Julien Muchembled's avatar
      54e819ff
    • Julien Muchembled's avatar
      Remove AskNodeInformation packet · d048a52d
      Julien Muchembled authored
      When Client (including backup master) and admin nodes are identified,
      the primary master now sends them automatically all nodes with
      NotifyNodeInformation, as with storage nodes.
      d048a52d
    • Julien Muchembled's avatar
      master: fix crashes in identification due to buggy nodes · 35664759
      Julien Muchembled authored
      - check address conflicts
      - on invalid values, reject peer instead of dying
      35664759
    • Julien Muchembled's avatar
      lib.node: fix NodeManager accessors returning identified nodes · e7cccf01
      Julien Muchembled authored
      Listing connected/connecting nodes with a UUID is used:
      - in one place by storage nodes: here, it does not matter if we skip nodes that
        aren't really identified
      - in many places by the master, only for server connections, in which case we
        have equivalence with real identification
      
      So in practice, NodeManager is only simplified to reuse the 'identified'
      property of nodes.
      e7cccf01
    • Julien Muchembled's avatar
      lib.node: code refactoring · 5941b27d
      Julien Muchembled authored
      5941b27d
    • Julien Muchembled's avatar
      storage: only accept clients that are known by the master · c17f5f91
      Julien Muchembled authored
      Therefore, a client node in the node manager is always RUNNING.
      c17f5f91
    • Julien Muchembled's avatar
      Give new ids to clients whose ids were already reallocated · d752aadb
      Julien Muchembled authored
      Although the change applies to any node with a temporary ids (all but storage),
      only clients don't have addresses and are therefore not recognizable.
      
      After a client is disconnected from the master and before reconnecting, another
      client may join the cluster and "steals" the id of the first client. This issue
      leads to stuck clients, failing in loop with exceptions like the following one:
      
          ERROR ZODB.Connection Couldn't load state for 0x0251
          Traceback (most recent call last):
            File "ZODB/Connection.py", line 860, in setstate
              self._setstate(obj)
            File "ZODB/Connection.py", line 901, in _setstate
              p, serial = self._storage.load(obj._p_oid, '')
            File "neo/client/Storage.py", line 82, in load
              return self.app.load(oid)[:2]
            File "neo/client/app.py", line 353, in load
              data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
            File "neo/client/app.py", line 373, in _loadFromStorage
              for node, conn in self.cp.iterateForObject(oid, readable=True):
            File "neo/client/pool.py", line 91, in iterateForObject
              pt = self.app.pt
            File "neo/client/app.py", line 145, in __getattr__
              self._getMasterConnection()
            File "neo/client/app.py", line 214, in _getMasterConnection
              result = self.master_conn = self._connectToPrimaryNode()
            File "neo/client/app.py", line 246, in _connectToPrimaryNode
              handler=handler)
            File "neo/lib/threaded_app.py", line 154, in _ask
              _handlePacket(qconn, qpacket, kw, handler)
            File "neo/lib/threaded_app.py", line 135, in _handlePacket
              handler.dispatch(conn, packet, kw)
            File "neo/lib/handler.py", line 66, in dispatch
              method(conn, *args, **kw)
            File "neo/lib/handler.py", line 188, in error
              getattr(self, Errors[code])(conn, message)
            File "neo/client/handlers/__init__.py", line 23, in protocolError
              raise StorageError("protocol error: %s" % message)
          StorageError: protocol error: already connected
      d752aadb
    • Julien Muchembled's avatar
      spelling: oudated -> outdated · b62b8dc3
      Julien Muchembled authored
      b62b8dc3
    • Julien Muchembled's avatar
      Fix spelling mistakes · 6e32ebb7
      Julien Muchembled authored
      6e32ebb7
  3. 25 Nov, 2016 2 commits
  4. 21 Nov, 2016 2 commits
    • Julien Muchembled's avatar
      client: fix item eviction from cache, which could break loading from storage · 4ef05b9e
      Julien Muchembled authored
      `ClientCache._oid_dict` shall not have empty values. For a given oid, when the
      last item is removed from the cache, the oid must be removed as well to free
      memory. In some cases, this was not done.
      
      A consequence of this bug is the following exception:
      
          ERROR ZODB.Connection Couldn't load state for 0x02d1e1e4
          Traceback (most recent call last):
            File "ZODB/Connection.py", line 860, in setstate
              self._setstate(obj)
            File "ZODB/Connection.py", line 901, in _setstate
              p, serial = self._storage.load(obj._p_oid, '')
            File "neo/client/Storage.py", line 82, in load
              return self.app.load(oid)[:2]
            File "neo/client/app.py", line 358, in load
              self._cache.store(oid, data, tid, next_tid)
            File "neo/client/cache.py", line 228, in store
              prev = item_list[-1]
          IndexError: list index out of range
      4ef05b9e
    • Julien Muchembled's avatar
  5. 15 Nov, 2016 2 commits
    • Kirill Smelkov's avatar
      backup: Teach cluster in BACKUPING state to also serve regular ZODB clients in read-only mode · d4944062
      Kirill Smelkov authored
      A backup cluster for tids <= backup_tid has all data to provide regular
      read-only ZODB service. Having regular ZODB access to the data can be
      handy e.g. for externally verifying data for consistency between
      main and backup clusters. Peeking around without disturbing main
      cluster might be also useful sometimes.
      
      In this patch:
      
      - master & storage nodes are taught:
      
          * to instantiate read-only or regular client service handler depending on cluster state:
            RUNNING   -> regular
            BACKINGUP -> read-only
      
          * in read-only client handler:
            + to reject write-related operations
            + to provide read operations but adjust semantic as last_tid in the database
              would be = backup_tid
      
      - new READ_ONLY_ACCESS protocol error code is introduced so that client can
        raise POSException.ReadOnlyError upon receiving it.
      
      I have not implemented back-channel for invalidations in read-only mode (yet ?).
      This way once a client connects to cluster in backup state, it won't see
      new data fetched by backup cluster from upstream after client connected.
      
      The reasons invalidations are not implemented is that for now (imho)
      there is no off-hand ready infrastructure to get updates from
      replicating node on transaction-by-transaction basis (it currently only
      notifies when whole batch is done). For consistency verification (main
      reason for this patch) we also don't need invalidations to work, as in
      that task we always connect afresh to backup. So I simply only put
      relevant TODOs about invalidations for now.
      
      The patch is not very polished but should work.
      
      /reviewed-on !4
      d4944062
    • Kirill Smelkov's avatar
  6. 27 Oct, 2016 1 commit
    • Iliya Manolov's avatar
      neoctl: make 'print ids' command display time of TIDs · d9dd39f0
      Iliya Manolov authored
      Currently, the command "neoctl [arguments] print ids" has the following output:
      
          last_oid = 0x...
          last_tid = 0x...
          last_ptid = ...
      
      or
      
          backup_tid = 0x...
          last_tid = 0x...
          last_ptid = ...
      
      depending on whether the cluster is in normal or backup mode.
      
      This is extremely unreadable since the admin is often interested in the time that corresponds to each tid. Now the output is:
      
          last_oid = 0x...
          last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_ptid = ...
      
      or
      
          backup_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_ptid = ...
      
      /reviewed-on !2
      d9dd39f0
  7. 17 Oct, 2016 1 commit
    • Kirill Smelkov's avatar
      mysql: force _getNextTID() to use appropriate/whole index · eaa00a88
      Kirill Smelkov authored
      Similarly to 13911ca3 on the same instance after MariaDB was upgraded to
      10.1.17 the following query, even after `OPTIMIZE TABLE obj`, started to execute
      very slowly:
      
          MariaDB [(none)]> SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +--------------------+
          | tid                |
          +--------------------+
          | 268707072758797063 |
          +--------------------+
          1 row in set (4.82 sec)
      
      Both explain and analyze says the query will/is using `partition` key but only partially (note key_len is only 10, not 18):
      
          MariaDB [(none)]> SHOW INDEX FROM neo1.obj;
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          | Table | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          | obj   |          0 | PRIMARY   |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | PRIMARY   |            2 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | PRIMARY   |            3 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            2 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            3 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          1 | data_id   |            1 | data_id     | A         |    28755928 |     NULL | NULL   | YES  | BTREE      |         |               |
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          7 rows in set (0.00 sec)
      
          MariaDB [(none)]> explain SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | Extra                    |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | Using where; Using index |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          1 row in set (0.00 sec)
      
          MariaDB [(none)]> analyze SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | r_rows     | filtered | r_filtered | Extra                    |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | 9741121.00 |   100.00 |       0.00 | Using where; Using index |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          1 row in set (4.93 sec)
      
      By explicitly forcing (partition, oid, tid) index usage which is precisely designed to serve this and similar queries can avoid the query from being slow:
      
          MariaDB [(none)]> analyze SELECT tid FROM neo1.obj FORCE INDEX(`partition`) WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          | id   | select_type | table | type  | possible_keys | key       | key_len | ref  | rows | r_rows | filtered | r_filtered | Extra                    |
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          |    1 | SIMPLE      | obj   | range | partition     | partition | 18      | NULL |    2 |   1.00 |   100.00 |     100.00 | Using where; Using index |
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          1 row in set (0.00 sec)
      
      /cc @jm, @vpelltier, @Tyagov
      
      /reviewed-on nexedi/neoppod!1
      eaa00a88
  8. 12 Sep, 2016 1 commit
  9. 29 Aug, 2016 2 commits
    • Julien Muchembled's avatar
      mysql: fix use of wrong SQL index when checking for dropped partitions · 13911ca3
      Julien Muchembled authored
      After partitions were dropped with TokuDB, we had a case where MariaDB 10.1.14
      stopped using the most appropriate index.
      
      MariaDB [neo0]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=5;
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      | id   | select_type | table | type  | possible_keys     | key     | key_len | ref  | rows | Extra                                 |
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      |    1 | SIMPLE      | obj   | range | PRIMARY,partition | data_id | 11      | NULL |   10 | Using where; Using index for group-by |
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      MariaDB [neo0]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=5;
      Empty set (1 min 51.47 sec)
      
      Expected:
      
      MariaDB [neo1]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=4;
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      | id   | select_type | table | type | possible_keys     | key     | key_len | ref   | rows | Extra                        |
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | PRIMARY | 2       | const |    1 | Using where; Using temporary |
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      1 row in set (0.00 sec)
      MariaDB [neo1]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=4;
      Empty set (0.00 sec)
      
      Restarting the server or 'OPTIMIZE TABLE obj; ' does not help.
      
      Such issue could prevent the cluster to start due to timeouts, by always going
      back to RECOVERING state.
      13911ca3
    • Julien Muchembled's avatar
      Update TODO · 00ffb1ef
      Julien Muchembled authored
      00ffb1ef
  10. 11 Aug, 2016 2 commits
    • Julien Muchembled's avatar
      Add test to check that a moved cell doesn't cause POSKeyError · df990a05
      Julien Muchembled authored
      Freeing disk space when a cell is dropped will have to be implemented with care,
      not only for performance reasons.
      df990a05
    • Julien Muchembled's avatar
      mysql: do not use unsafe TRUNCATE statement · c3c2ffe2
      Julien Muchembled authored
      TRUNCATE was chosen for performance reasons, but it's usually done on small
      tables, and not for performance-critical operations. TRUNCATE commits
      implicitely, so for pt/ttrans in particular, it's certainly slower due to extra
      fsyncs to disk.
      
      On the other side, committing too early can corrupt the database if the storage
      node is stopped just after. For example, a failure in changePartitionTable()
      can cause 'pt' to remain empty.
      c3c2ffe2
  11. 01 Aug, 2016 2 commits
  12. 31 Jul, 2016 1 commit
    • Julien Muchembled's avatar
      storage: review TransactionManager.abortFor · 2d388048
      Julien Muchembled authored
      This reverts commit 7aecdada partially.
      There seems to be no bug here, because:
      - abortFor() is only called upon a notification from the master that a client
        is disconnected,
      - and from the same TCP connection, we only receive a LockInformation packet
        if there's still such a transaction on the master side.
      
      The code removed in abortFor() was redundant with abort().
      2d388048
  13. 27 Jul, 2016 6 commits
    • Julien Muchembled's avatar
      cb144fdb
    • Julien Muchembled's avatar
      38583af9
    • Julien Muchembled's avatar
      client: do not limit the number of open connections to storage nodes · 77132157
      Julien Muchembled authored
      There was a bug that connections were not maintained during a TPC,
      which caused transactions to be aborted when the limit was reached.
      
      Given that oids are spreaded evenly over all partitions, and that clients always
      write to all cells of each involved partitions, clients would spend their time
      reconnecting to storage nodes as soon as the limit is reached. So such feature
      really looks counter-productive.
      77132157
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      client: fix conflict of node id by never reading from storage without being connected to the master · 11d83ad9
      Julien Muchembled authored
      Client nodes ignored the state of the connection to the master node when reading
      data from storage, as long as their partition tables were recent enough. This
      way, they were able to finish read-only transactions even if they could't reach
      the master, which could be useful for high availability. The downside is that
      the master node ignored that their node ids were still used, which causes "uuid"
      conflicts when reallocating them.
      
      Rejected solutions:
      - An unused NEO Storage should not insist in staying connected to master node.
      - Reverting to big random node identifiers is a lot of work and it would make
        debugging annoying (see commit 23fad3af).
      - Always increasing node ids could have been a simple solution if we accepted
        that the cluster dies after that all 2^24 possible ids were allocated.
      
      Given that reading from storage without being connected to the master can only
      be useful to finish the current transaction (because we always ping the master
      at the beginning of every transaction), keeping such feature is not worth the
      effort.
      
      This commit fixes id conflicts in a very simple way, by clearing the partition
      table upon primary node failure, which forces reconnection to the master before
      querying any storage node. In such case, we raise a special exception that will
      cause the transaction to be restarted, so that the user does not get errors for
      temporary connection failures.
      11d83ad9
    • Julien Muchembled's avatar
      storage: add comment about the idea to lock an oid before reporting a resolvable conflict · 4e17456b
      Julien Muchembled authored
      Currently, another argument not to lock is that we would not be able to test
      incremental resolution anymore. We can think about this again when deadlock
      resolution is implemented.
      4e17456b
  14. 24 Jul, 2016 4 commits
    • Julien Muchembled's avatar
      Fix race conditions in EventManager between _poll/connection_dict and (un)registration · 8b91706a
      Julien Muchembled authored
      The following error was reported on a client node:
      
          #0x0000 Error                   < None (2001:...:2051)
          1 (Retry Later)
          connection closed for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, closed, client) at 7f1ea7c42f90>
          Event Manager:
          connection started for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10>
          #0x0000 RequestIdentification          > None (2001:...:2051)
            Readers: []
            Writers: []
            Connections:
              13: <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> (pending=False)
          Node manager : 1 nodes
          * None |   MASTER | 2001:...:2051 | UNKNOWN
          <ClientCache history_size=0 oid_count=0 size=0 time=0 queue_length=[0] (life_time=10000 max_history_size=100000 max_size=20971520)>
          poll raised, retrying
          Traceback (most recent call last):
            File "neo/lib/threaded_app.py", line 93, in _run
              poll(1)
            File "neo/lib/event.py", line 134, in poll
              self._poll(0)
            File "neo/lib/event.py", line 164, in _poll
              conn = self.connection_dict[fd]
          KeyError: 13
      
      which means that:
      - while the poll thread is getting a (13, EPOLLIN) event because it is
        closed (aborted by the master)
      - another thread processes the error packet, by closing it in
        PrimaryBootstrapHandler.notReady
      - next, the poll thread resumes the execution of EpollEventManager._poll
        and fails to find fd=13 in self.connection_dict
      
      So here, we have a race condition between epoll_wait and any further use
      of connection_dict to map returned fds.
      
      However, what commit a4731a0c does to handle
      the case of fd reallocation only works for mono-threaded applications.
      In EPOLLIN, wrapping 'self.connection_dict[fd]' the same way as for other
      events is not enough. For example:
      - case 1:
        - thread 1: epoll returns fd=13
        - thread 2: close(13)
        - thread 2: open(13)
        - thread 1: self.connection_dict[13] does not match
                    but this would be handled by the 'unregistered' list
      - case 2:
        - thread 1: reset 'unregistered'
        - thread 2: close(13)
        - thread 2: open(13)
        - thread 1: epoll returns fd=13
        - thread 1: self.connection_dict[13] matches
                    but it would be wrongly ignored by 'unregistered'
      - case 3:
        - thread 1: about to call readable/writable/onTimeout on a connection
        - thread 2: this connection is closed
        - thread 1: readable/writable/onTimeout wrongly called on a closed connection
      
      We could protect _poll() with a lock, and make unregister() use wakeup() so
      that it gets a chance to acquire it, but that causes threaded tests to deadlock
      (continuing in this direction seems too complicated).
      
      So we have to deal with the fact that there can be race conditions at any time
      and there's no way to make 'connection_dict' match exactly what epoll returns.
      We solve this by preventing fd reallocation inside _poll(), which is fortunately
      possible with sockets, using 'shutdown': the closing of fds is delayed.
      
      For above case 3, readable/writable/onTimeout for MTClientConnection are also
      changed to test whether the connection is still open while it has the lock.
      Just for safety, we do the same for 'process'.
      
      At last, another kind of race condition that this commit also fixes concerns
      the use of itervalues() on EventManager.connection_dict.
      8b91706a
    • Julien Muchembled's avatar
      Indent many lines before any real change · 4a0b936f
      Julien Muchembled authored
      This is a preliminary commit, without any functional change,
      just to make the next one easier to review.
      4a0b936f
    • Julien Muchembled's avatar
      client: remove redundant check of new connections to the master · 9f4dd15e
      Julien Muchembled authored
      We already have logs when a connection fails,
      and ask() raises ConnectionClosed if the connection is closed.
      9f4dd15e
    • Vincent Pelletier's avatar
      e791dc3f