neoppod:f22d5c4ebf19cb6df33f03b7ac3cea52dc945cbb commitshttps://lab.nexedi.com/nexedi/neoppod/-/commits/f22d5c4ebf19cb6df33f03b7ac3cea52dc945cbb2016-11-28T15:40:11+01:00https://lab.nexedi.com/nexedi/neoppod/-/commit/f22d5c4ebf19cb6df33f03b7ac3cea52dc945cbbEnable branch coverage measurement by default2016-11-28T15:40:11+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/bec29220aa541df23b5ad4a47805547a2348f0d3coverage: add support for functional tests2016-11-28T01:50:32+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/8eb14b0184cc5cde5df16fa52367e27e34bad404Bump protocol version2016-11-27T02:20:09+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/9385706ff5279d0205d61034023cb6053112561bFix identification issues, including a race condition causing id conflicts2016-11-27T02:18:59+01:00Julien Muchembledjm@nexedi.com
The added test describes how the new id timestamps fix the race condition.
These timestamps could be any unique opaque values, and the protocol is
extended to exchange them along with node ids.
Internally, nodes also reuse timestamps as a marker to identify the first
NotifyNodeInformation packets from the master: since this packet is a complete
list of nodes in the cluster, any other node in the node manager has left the
cluster definitely and is removed.
The secondary masters didn't receive update about master nodes.
It's also useless to send them information about non-master nodes.https://lab.nexedi.com/nexedi/neoppod/-/commit/54e819ff803ffa072d03788dfb7db849780853eaprotocol: simplify definition of Struct-based items2016-11-27T02:18:51+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/d048a52d2ef88e1791370f422e9d29ce64ba729bRemove AskNodeInformation packet2016-11-27T02:18:51+01:00Julien Muchembledjm@nexedi.com
When Client (including backup master) and admin nodes are identified,
the primary master now sends them automatically all nodes with
NotifyNodeInformation, as with storage nodes.https://lab.nexedi.com/nexedi/neoppod/-/commit/35664759887780b460ce53a2e080d6160dc9b808master: fix crashes in identification due to buggy nodes2016-11-27T02:18:51+01:00Julien Muchembledjm@nexedi.com
- check address conflicts
- on invalid values, reject peer instead of dyinghttps://lab.nexedi.com/nexedi/neoppod/-/commit/e7cccf012fe6c456649a0dfe14a4a333f5cc79d7lib.node: fix NodeManager accessors returning identified nodes2016-11-27T02:18:50+01:00Julien Muchembledjm@nexedi.com
Listing connected/connecting nodes with a UUID is used:
- in one place by storage nodes: here, it does not matter if we skip nodes that
aren't really identified
- in many places by the master, only for server connections, in which case we
have equivalence with real identification
So in practice, NodeManager is only simplified to reuse the 'identified'
property of nodes.https://lab.nexedi.com/nexedi/neoppod/-/commit/5941b27daa19c6ae7d8f19f10f26d55641f3ca42lib.node: code refactoring2016-11-27T02:18:50+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/c17f5f919824b89f5982d47e6dd27864e0ca9d60storage: only accept clients that are known by the master2016-11-27T02:18:50+01:00Julien Muchembledjm@nexedi.com
Therefore, a client node in the node manager is always RUNNING.https://lab.nexedi.com/nexedi/neoppod/-/commit/d752aadb7ec87d18cd4f8d1ac299ced688f26985Give new ids to clients whose ids were already reallocated2016-11-27T02:18:50+01:00Julien Muchembledjm@nexedi.com
Although the change applies to any node with a temporary ids (all but storage),
only clients don't have addresses and are therefore not recognizable.
After a client is disconnected from the master and before reconnecting, another
client may join the cluster and "steals" the id of the first client. This issue
leads to stuck clients, failing in loop with exceptions like the following one:
ERROR ZODB.Connection Couldn't load state for 0x0251
Traceback (most recent call last):
File "ZODB/Connection.py", line 860, in setstate
self._setstate(obj)
File "ZODB/Connection.py", line 901, in _setstate
p, serial = self._storage.load(obj._p_oid, '')
File "neo/client/Storage.py", line 82, in load
return self.app.load(oid)[:2]
File "neo/client/app.py", line 353, in load
data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
File "neo/client/app.py", line 373, in _loadFromStorage
for node, conn in self.cp.iterateForObject(oid, readable=True):
File "neo/client/pool.py", line 91, in iterateForObject
pt = self.app.pt
File "neo/client/app.py", line 145, in __getattr__
self._getMasterConnection()
File "neo/client/app.py", line 214, in _getMasterConnection
result = self.master_conn = self._connectToPrimaryNode()
File "neo/client/app.py", line 246, in _connectToPrimaryNode
handler=handler)
File "neo/lib/threaded_app.py", line 154, in _ask
_handlePacket(qconn, qpacket, kw, handler)
File "neo/lib/threaded_app.py", line 135, in _handlePacket
handler.dispatch(conn, packet, kw)
File "neo/lib/handler.py", line 66, in dispatch
method(conn, *args, **kw)
File "neo/lib/handler.py", line 188, in error
getattr(self, Errors[code])(conn, message)
File "neo/client/handlers/__init__.py", line 23, in protocolError
raise StorageError("protocol error: %s" % message)
StorageError: protocol error: already connectedhttps://lab.nexedi.com/nexedi/neoppod/-/commit/b62b8dc327901a002a311a064c820a745625fe6fspelling: oudated -> outdated2016-11-27T02:18:50+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/6e32ebb78148a8a7426f868b67f7c91ddfefc97aFix spelling mistakes2016-11-27T02:17:56+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/b61f8745cc63f8096615f1f88255af17f6a23a44coverage: CacheItem.__repr__ (client)2016-11-25T14:45:10+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/5de0ff3a6ed403dde0038fcda02ec98ca89d834eNew neotestrunner option for code coverage testing2016-11-25T14:45:10+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/4ef05b9e0fa16e08b4fd7608f2445faaf146f0cfclient: fix item eviction from cache, which could break loading from storage2016-11-21T17:23:06+01:00Julien Muchembledjm@nexedi.com
`ClientCache._oid_dict` shall not have empty values. For a given oid, when the
last item is removed from the cache, the oid must be removed as well to free
memory. In some cases, this was not done.
A consequence of this bug is the following exception:
ERROR ZODB.Connection Couldn't load state for 0x02d1e1e4
Traceback (most recent call last):
File "ZODB/Connection.py", line 860, in setstate
self._setstate(obj)
File "ZODB/Connection.py", line 901, in _setstate
p, serial = self._storage.load(obj._p_oid, '')
File "neo/client/Storage.py", line 82, in load
return self.app.load(oid)[:2]
File "neo/client/app.py", line 358, in load
self._cache.store(oid, data, tid, next_tid)
File "neo/client/cache.py", line 228, in store
prev = item_list[-1]
IndexError: list index out of rangehttps://lab.nexedi.com/nexedi/neoppod/-/commit/2b3993f167da16ae68ddbbf383b61443a4e6a79cBump protocol version for new read-only mode in BACKUPING state2016-11-21T17:23:06+01:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/d49440623932d8ff9eb36e9da7b1ee03a1f22948backup: Teach cluster in BACKUPING state to also serve regular ZODB clients i...2016-11-15T11:57:22+01:00Kirill Smelkovkirr@nexedi.com
A backup cluster for tids <= backup_tid has all data to provide regular
read-only ZODB service. Having regular ZODB access to the data can be
handy e.g. for externally verifying data for consistency between
main and backup clusters. Peeking around without disturbing main
cluster might be also useful sometimes.
In this patch:
- master & storage nodes are taught:
* to instantiate read-only or regular client service handler depending on cluster state:
RUNNING -> regular
BACKINGUP -> read-only
* in read-only client handler:
+ to reject write-related operations
+ to provide read operations but adjust semantic as last_tid in the database
would be = backup_tid
- new READ_ONLY_ACCESS protocol error code is introduced so that client can
raise POSException.ReadOnlyError upon receiving it.
I have not implemented back-channel for invalidations in read-only mode (yet ?).
This way once a client connects to cluster in backup state, it won't see
new data fetched by backup cluster from upstream after client connected.
The reasons invalidations are not implemented is that for now (imho)
there is no off-hand ready infrastructure to get updates from
replicating node on transaction-by-transaction basis (it currently only
notifies when whole batch is done). For consistency verification (main
reason for this patch) we also don't need invalidations to work, as in
that task we always connect afresh to backup. So I simply only put
relevant TODOs about invalidations for now.
The patch is not very polished but should work.
/reviewed-on <a href="https://lab.nexedi.com/nexedi/neoppod/merge_requests/4" data-original="https://lab.nexedi.com/nexedi/neoppod/merge_requests/4" data-link="false" data-link-reference="true" data-project="72" data-merge-request="771" data-project-path="nexedi/neoppod" data-iid="4" data-mr-title="backup: Teach cluster in BACKUPING state to also serve regular ZODB clients in read-only mode" data-reference-type="merge_request" data-container="body" data-placement="top" data-html="true" title="" class="gfm gfm-merge_request">nexedi/neoppod!4</a>https://lab.nexedi.com/nexedi/neoppod/-/commit/ab552d87356766ce600d16b703fdfc79ffdac0f5tests/threaded: Add handy shortcuts to NEOCluster to concisely check cluster ...2016-11-15T11:56:39+01:00Kirill Smelkovkirr@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/d9dd39f083783e2e351237d31efcbcc892819194neoctl: make 'print ids' command display time of TIDs2016-10-27T17:10:30+02:00Iliya Manolovilmanfordinner@gmail.com
Currently, the command "neoctl [arguments] print ids" has the following output:
last_oid = 0x...
last_tid = 0x...
last_ptid = ...
or
backup_tid = 0x...
last_tid = 0x...
last_ptid = ...
depending on whether the cluster is in normal or backup mode.
This is extremely unreadable since the admin is often interested in the time that corresponds to each tid. Now the output is:
last_oid = 0x...
last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
last_ptid = ...
or
backup_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
last_ptid = ...
/reviewed-on <a href="https://lab.nexedi.com/nexedi/neoppod/merge_requests/2" data-original="https://lab.nexedi.com/nexedi/neoppod/merge_requests/2" data-link="false" data-link-reference="true" data-project="72" data-merge-request="737" data-project-path="nexedi/neoppod" data-iid="2" data-mr-title="neoctl: make 'print ids' command display time of TIDs" data-reference-type="merge_request" data-container="body" data-placement="top" data-html="true" title="" class="gfm gfm-merge_request">!2</a>https://lab.nexedi.com/nexedi/neoppod/-/commit/eaa00a88416c10928f82ed37356756a64298d848mysql: force _getNextTID() to use appropriate/whole index2016-10-17T20:08:04+02:00Kirill Smelkovkirr@nexedi.com
Similarly to <a href="/kirr/neo/-/commit/13911ca3d871a427fdf098c00558023ab76683a8" data-original="13911ca3" data-link="false" data-link-reference="false" data-project="73" data-commit="13911ca3d871a427fdf098c00558023ab76683a8" data-reference-type="commit" data-container="body" data-placement="top" data-html="true" title="mysql: fix use of wrong SQL index when checking for dropped partitions" class="gfm gfm-commit has-tooltip">13911ca3</a> on the same instance after MariaDB was upgraded to
10.1.17 the following query, even after `OPTIMIZE TABLE obj`, started to execute
very slowly:
MariaDB [(none)]> SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
+--------------------+
| tid |
+--------------------+
| 268707072758797063 |
+--------------------+
1 row in set (4.82 sec)
Both explain and analyze says the query will/is using `partition` key but only partially (note key_len is only 10, not 18):
MariaDB [(none)]> SHOW INDEX FROM neo1.obj;
+-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| obj | 0 | PRIMARY | 1 | partition | A | 28755928 | NULL | NULL | | BTREE | | |
| obj | 0 | PRIMARY | 2 | tid | A | 28755928 | NULL | NULL | | BTREE | | |
| obj | 0 | PRIMARY | 3 | oid | A | 28755928 | NULL | NULL | | BTREE | | |
| obj | 0 | partition | 1 | partition | A | 28755928 | NULL | NULL | | BTREE | | |
| obj | 0 | partition | 2 | oid | A | 28755928 | NULL | NULL | | BTREE | | |
| obj | 0 | partition | 3 | tid | A | 28755928 | NULL | NULL | | BTREE | | |
| obj | 1 | data_id | 1 | data_id | A | 28755928 | NULL | NULL | YES | BTREE | | |
+-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
7 rows in set (0.00 sec)
MariaDB [(none)]> explain SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
+------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
| 1 | SIMPLE | obj | ref | PRIMARY,partition | partition | 10 | const,const | 2 | Using where; Using index |
+------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
1 row in set (0.00 sec)
MariaDB [(none)]> analyze SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
+------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |
+------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
| 1 | SIMPLE | obj | ref | PRIMARY,partition | partition | 10 | const,const | 2 | 9741121.00 | 100.00 | 0.00 | Using where; Using index |
+------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
1 row in set (4.93 sec)
By explicitly forcing (partition, oid, tid) index usage which is precisely designed to serve this and similar queries can avoid the query from being slow:
MariaDB [(none)]> analyze SELECT tid FROM neo1.obj FORCE INDEX(`partition`) WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
+------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |
+------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
| 1 | SIMPLE | obj | range | partition | partition | 18 | NULL | 2 | 1.00 | 100.00 | 100.00 | Using where; Using index |
+------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
1 row in set (0.00 sec)
/cc <a href="/jm" data-user="30" data-reference-type="user" data-container="body" data-placement="top" data-html="true" class="gfm gfm-project_member" title="Julien Muchembled">@jm</a>, @vpelltier, <a href="/Tyagov" data-user="15" data-reference-type="user" data-container="body" data-placement="top" data-html="true" class="gfm gfm-project_member" title="Ivan Tyagov">@Tyagov</a>
/reviewed-on <a href="https://lab.nexedi.com/nexedi/neoppod/merge_requests/1" data-original="https://lab.nexedi.com/nexedi/neoppod/merge_requests/1" data-link="false" data-link-reference="true" data-project="72" data-merge-request="725" data-project-path="nexedi/neoppod" data-iid="1" data-mr-title="mysql: force _getNextTID() to use appropriate/whole index" data-reference-type="merge_request" data-container="body" data-placement="top" data-html="true" title="" class="gfm gfm-merge_request">nexedi/neoppod!1</a>https://lab.nexedi.com/nexedi/neoppod/-/commit/c39d5c670e55f4f739d4dd633fe862c37ecde9abAdd support for latest versions of ZODB (4.4.3 & 5.0.1)2016-09-12T23:51:10+02:00Julien Muchembledjm@nexedi.com
Many patches have been merged upstream :)
A notable change is that lastTransaction() does not ping the master anymore
(but it still causes a connection to the master if the client is disconnected).https://lab.nexedi.com/nexedi/neoppod/-/commit/13911ca3d871a427fdf098c00558023ab76683a8mysql: fix use of wrong SQL index when checking for dropped partitions2016-08-29T15:57:30+02:00Julien Muchembledjm@nexedi.com
After partitions were dropped with TokuDB, we had a case where MariaDB 10.1.14
stopped using the most appropriate index.
MariaDB [neo0]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=5;
+------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
| 1 | SIMPLE | obj | range | PRIMARY,partition | data_id | 11 | NULL | 10 | Using where; Using index for group-by |
+------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
MariaDB [neo0]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=5;
Empty set (1 min 51.47 sec)
Expected:
MariaDB [neo1]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=4;
+------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
| 1 | SIMPLE | obj | ref | PRIMARY,partition | PRIMARY | 2 | const | 1 | Using where; Using temporary |
+------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
1 row in set (0.00 sec)
MariaDB [neo1]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=4;
Empty set (0.00 sec)
Restarting the server or 'OPTIMIZE TABLE obj; ' does not help.
Such issue could prevent the cluster to start due to timeouts, by always going
back to RECOVERING state.https://lab.nexedi.com/nexedi/neoppod/-/commit/00ffb1eff8f4e6e58e3bb14bbf32d5346f8638b2Update TODO2016-08-29T15:30:29+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/df990a05696cf880096d2f4180ed8d7573b975cdAdd test to check that a moved cell doesn't cause POSKeyError2016-08-11T19:26:05+02:00Julien Muchembledjm@nexedi.com
Freeing disk space when a cell is dropped will have to be implemented with care,
not only for performance reasons.https://lab.nexedi.com/nexedi/neoppod/-/commit/c3c2ffe2267146ebaaa1d1171dbfe8e9db36c7f3mysql: do not use unsafe TRUNCATE statement2016-08-11T11:50:26+02:00Julien Muchembledjm@nexedi.com
TRUNCATE was chosen for performance reasons, but it's usually done on small
tables, and not for performance-critical operations. TRUNCATE commits
implicitely, so for pt/ttrans in particular, it's certainly slower due to extra
fsyncs to disk.
On the other side, committing too early can corrupt the database if the storage
node is stopped just after. For example, a failure in changePartitionTable()
can cause 'pt' to remain empty.https://lab.nexedi.com/nexedi/neoppod/-/commit/e25fa5d994ed5b1a41ad7293bec13166a99f2edbstorage: speed up transaction registration2016-08-01T19:02:11+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/c3d3dabdee2485aa83b843b4613e3199a2000015storage: remove uuid index in TransactionManager2016-08-01T18:31:20+02:00Julien Muchembledjm@nexedi.com
It slowed down everything but abortFor(), which is not performance critical.https://lab.nexedi.com/nexedi/neoppod/-/commit/2d3880482794659d68ca5f54294e794d78965f7bstorage: review TransactionManager.abortFor2016-07-31T23:38:56+02:00Julien Muchembledjm@nexedi.com
This reverts commit <a href="/nexedi/neoppod/-/commit/7aecdada3c9b1de794192c7d607177e58eac82bc" data-original="7aecdada3c9b1de794192c7d607177e58eac82bc" data-link="false" data-link-reference="false" data-project="72" data-commit="7aecdada3c9b1de794192c7d607177e58eac82bc" data-reference-type="commit" data-container="body" data-placement="top" data-html="true" title="storage: fix crash when a client disconnects just after it requested to finish a transaction" class="gfm gfm-commit has-tooltip">7aecdada</a> partially.
There seems to be no bug here, because:
- abortFor() is only called upon a notification from the master that a client
is disconnected,
- and from the same TCP connection, we only receive a LockInformation packet
if there's still such a transaction on the master side.
The code removed in abortFor() was redundant with abort().https://lab.nexedi.com/nexedi/neoppod/-/commit/cb144fdb2a36b20a032602c727c8ccc5dd1e876aReenable checkTransactionalUndoIterator2016-07-27T21:27:15+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/38583af937e2ca0c568f960345c6426daa720ed9client: better exception handling in tpc_abort2016-07-27T19:15:43+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/7713215702fbf296458ff1001a3210f4deeeabadclient: do not limit the number of open connections to storage nodes2016-07-27T19:15:43+02:00Julien Muchembledjm@nexedi.com
There was a bug that connections were not maintained during a TPC,
which caused transactions to be aborted when the limit was reached.
Given that oids are spreaded evenly over all partitions, and that clients always
write to all cells of each involved partitions, clients would spend their time
reconnecting to storage nodes as soon as the limit is reached. So such feature
really looks counter-productive.https://lab.nexedi.com/nexedi/neoppod/-/commit/cfe1b5ca0bd956af6f02cdfeb0e2db5d437dedaaclient: small optimization when iterating over storage connections2016-07-27T19:15:43+02:00Julien Muchembledjm@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/11d83ad999579306398a83172947521609b9b992client: fix conflict of node id by never reading from storage without being c...2016-07-27T19:15:43+02:00Julien Muchembledjm@nexedi.com
Client nodes ignored the state of the connection to the master node when reading
data from storage, as long as their partition tables were recent enough. This
way, they were able to finish read-only transactions even if they could't reach
the master, which could be useful for high availability. The downside is that
the master node ignored that their node ids were still used, which causes "uuid"
conflicts when reallocating them.
Rejected solutions:
- An unused NEO Storage should not insist in staying connected to master node.
- Reverting to big random node identifiers is a lot of work and it would make
debugging annoying (see commit <a href="/nexedi/neoppod/-/commit/23fad3af8aeb9cec4382f8d8e08ac3b8c6219364" data-original="23fad3af8aeb9cec4382f8d8e08ac3b8c6219364" data-link="false" data-link-reference="false" data-project="72" data-commit="23fad3af8aeb9cec4382f8d8e08ac3b8c6219364" data-reference-type="commit" data-container="body" data-placement="top" data-html="true" title="Reduce size of UUIDs to 4-bytes" class="gfm gfm-commit has-tooltip">23fad3af</a>).
- Always increasing node ids could have been a simple solution if we accepted
that the cluster dies after that all 2^24 possible ids were allocated.
Given that reading from storage without being connected to the master can only
be useful to finish the current transaction (because we always ping the master
at the beginning of every transaction), keeping such feature is not worth the
effort.
This commit fixes id conflicts in a very simple way, by clearing the partition
table upon primary node failure, which forces reconnection to the master before
querying any storage node. In such case, we raise a special exception that will
cause the transaction to be restarted, so that the user does not get errors for
temporary connection failures.https://lab.nexedi.com/nexedi/neoppod/-/commit/4e17456bee893653b5ae71dbe09b8dbeda27b62bstorage: add comment about the idea to lock an oid before reporting a resolva...2016-07-27T19:15:38+02:00Julien Muchembledjm@nexedi.com
Currently, another argument not to lock is that we would not be able to test
incremental resolution anymore. We can think about this again when deadlock
resolution is implemented.https://lab.nexedi.com/nexedi/neoppod/-/commit/8b91706a958df09d475a5ea18ee72c4c7678dab0Fix race conditions in EventManager between _poll/connection_dict and (un)reg...2016-07-24T20:53:14+02:00Julien Muchembledjm@nexedi.com
The following error was reported on a client node:
#0x0000 Error < None (2001:...:2051)
1 (Retry Later)
connection closed for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, closed, client) at 7f1ea7c42f90>
Event Manager:
connection started for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10>
#0x0000 RequestIdentification > None (2001:...:2051)
Readers: []
Writers: []
Connections:
13: <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> (pending=False)
Node manager : 1 nodes
* None | MASTER | 2001:...:2051 | UNKNOWN
<ClientCache history_size=0 oid_count=0 size=0 time=0 queue_length=[0] (life_time=10000 max_history_size=100000 max_size=20971520)>
poll raised, retrying
Traceback (most recent call last):
File "neo/lib/threaded_app.py", line 93, in _run
poll(1)
File "neo/lib/event.py", line 134, in poll
self._poll(0)
File "neo/lib/event.py", line 164, in _poll
conn = self.connection_dict[fd]
KeyError: 13
which means that:
- while the poll thread is getting a (13, EPOLLIN) event because it is
closed (aborted by the master)
- another thread processes the error packet, by closing it in
PrimaryBootstrapHandler.notReady
- next, the poll thread resumes the execution of EpollEventManager._poll
and fails to find fd=13 in self.connection_dict
So here, we have a race condition between epoll_wait and any further use
of connection_dict to map returned fds.
However, what commit <a href="/nexedi/neoppod/-/commit/a4731a0c7b8a0e8d938d114b9ca0151f3f21e98d" data-original="a4731a0c7b8a0e8d938d114b9ca0151f3f21e98d" data-link="false" data-link-reference="false" data-project="72" data-commit="a4731a0c7b8a0e8d938d114b9ca0151f3f21e98d" data-reference-type="commit" data-container="body" data-placement="top" data-html="true" title="Fix invalid processing of unregistered connections" class="gfm gfm-commit has-tooltip">a4731a0c</a> does to handle
the case of fd reallocation only works for mono-threaded applications.
In EPOLLIN, wrapping 'self.connection_dict[fd]' the same way as for other
events is not enough. For example:
- case 1:
- thread 1: epoll returns fd=13
- thread 2: close(13)
- thread 2: open(13)
- thread 1: self.connection_dict[13] does not match
but this would be handled by the 'unregistered' list
- case 2:
- thread 1: reset 'unregistered'
- thread 2: close(13)
- thread 2: open(13)
- thread 1: epoll returns fd=13
- thread 1: self.connection_dict[13] matches
but it would be wrongly ignored by 'unregistered'
- case 3:
- thread 1: about to call readable/writable/onTimeout on a connection
- thread 2: this connection is closed
- thread 1: readable/writable/onTimeout wrongly called on a closed connection
We could protect _poll() with a lock, and make unregister() use wakeup() so
that it gets a chance to acquire it, but that causes threaded tests to deadlock
(continuing in this direction seems too complicated).
So we have to deal with the fact that there can be race conditions at any time
and there's no way to make 'connection_dict' match exactly what epoll returns.
We solve this by preventing fd reallocation inside _poll(), which is fortunately
possible with sockets, using 'shutdown': the closing of fds is delayed.
For above case 3, readable/writable/onTimeout for MTClientConnection are also
changed to test whether the connection is still open while it has the lock.
Just for safety, we do the same for 'process'.
At last, another kind of race condition that this commit also fixes concerns
the use of itervalues() on EventManager.connection_dict.https://lab.nexedi.com/nexedi/neoppod/-/commit/4a0b936f347d03989eb9367d84801b6b74394e32Indent many lines before any real change2016-07-24T20:53:14+02:00Julien Muchembledjm@nexedi.com
This is a preliminary commit, without any functional change,
just to make the next one easier to review.https://lab.nexedi.com/nexedi/neoppod/-/commit/9f4dd15e13b3667ec8ac787347f0198a390b179dclient: remove redundant check of new connections to the master2016-07-24T20:53:14+02:00Julien Muchembledjm@nexedi.com
We already have logs when a connection fails,
and ask() raises ConnectionClosed if the connection is closed.https://lab.nexedi.com/nexedi/neoppod/-/commit/e791dc3f7aca4b2d69c776baf67c7eae11faf582Control verbose locking via en environment variable2016-07-24T16:55:04+02:00Vincent Pelletiervincent@nexedi.comhttps://lab.nexedi.com/nexedi/neoppod/-/commit/b7e0ec7f880619003d5eac3a676133de16dca90aclient: avoid (harmless) variable shadowing2016-07-24T16:47:07+02:00Vincent Pelletiervincent@nexedi.com