1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
RC = Release Critical (for next release)
Documentation
- Clarify node state signification, and consider renaming them in the code.
Ideas:
TEMPORARILY_DOWN becomes UNAVAILABLE
BROKEN is removed ?
- Clarify the use of each error codes:
- NOT_READY removed (connection kept opened until ready)
- Split PROTOCOL_ERROR (BAD IDENTIFICATION, ...)
- Add docstrings (think of doctests)
Code
Code changes often impact more than just one node. They are categorised by
node where the most important changes are needed.
General
RC - Review XXX in the code (CODE)
RC - Review TODO in the code (CODE)
RC - Review output of pylint (CODE)
- tpc_finish might raise while transaction got successfully committed.
This can happen if it gets disconnected from primary master while waiting
for AnswerFinishTransaction after primary received it and hence will
commit transaction independently from client presence. Client could
legitimaltely think transaction is not committed, and might decide to
retry. To solve this, client can know if its TTID got successfuly
committed by looking at currently unused '(t)trans.ttid' column.
- Keep-alive (HIGH AVAILABILITY) (implemented, to be reviewed and tested)
Consider the need to implement a keep-alive system (packets sent
automatically when there is no activity on the connection for a period
of time).
- Consider using multicast for cluster-wide notifications. (BANDWITH)
Currently, multi-receivers notifications are sent in unicast to each
receiver. Multicast should be used.
- Remove sleeps (LATENCY, CPU WASTE)
Code still contains many delays (explicit sleeps or polling timeouts).
They must be removed to be either infinite (sleep until some condition
becomes true, without waking up needlessly in the meantime) or null
(don't wait at all).
- Implements delayed connection acceptation.
Currently, any node that connects to early to another that is busy for
some reasons is immediately rejected with the 'not ready' error code. This
should be replaced by a queue in the listening node that keep a pool a
nodes that will be accepted late, when the conditions will be satisfied.
This is mainly the case for :
- Client rejected before the cluster is operational
- Empty storages rejected during recovery process
Masters implies in the election process should still reject any connection
as the primary master is still unknown.
- Connections must support 2 simultaneous handlers (CODE)
Connections currently define only one handler, which is enough for
monothreaded code. But when using multithreaded code, there are 2
possible handlers involved in a packet reception:
- The first one handles notifications only (nothing special to do
regarding multithreading)
- The second one handles expected messages (such message must be
directed to the right thread)
The second handler must be possible to set on the connection when that
connection is thread-safe (MT version of connection classes).
Also, the code to detect wether a response is expected or not must be
genericised and moved out of handlers.
- Implement transaction garbage collection API (FEATURE)
NEO packing implementation does not update transaction metadata when
deleting object revisions. This inconsistency must be made possible to
clean up from a client application, much in the same way garbage
collection part of packing is done.
- Factorise node initialisation for admin, client and storage (CODE)
The same code to ask/receive node list and partition table exists in too
many places.
- Clarify handler methods to call when a connection is accepted from a
listening conenction and when remote node is identified
(cf. neo/lib/bootstrap.py).
- Choose how to handle a storage integrity verification when it comes back.
Do the replication process, the verification stage, with or without
unfinished transactions, cells have to set as outdated, if yes, should the
partition table changes be broadcasted ? (BANDWITH, SPEED)
- Make SIGINT on primary master change cluster in STOPPING state.
- Review PENDING/HIDDEN/SHUTDOWN states, don't use notifyNodeInformation()
to do a state-switch, use a exception-based mechanism ? (CODE)
- Split protocol.py in a 'protocol' module ?
- Review handler split (CODE)
The current handler split is the result of small incremental changes. A
global review is required to make them square.
- Make handler instances become singletons (SPEED, MEMORY)
In some places handlers are instanciated outside of App.__init__ . As a
handler is completely re-entrant (no modifiable properties) it can and
should be made a singleton (saves the CPU time needed to instanciates all
the copies - often when a connection is established, saves the memory
used by each copy).
- Consider replace setNodeState admin packet by one per action, like
dropNode to reduce packet processing complexity and reduce bad actions
like set a node in TEMPORARILY_DOWN state.
- Review node notfications. Eg. A storage don't have to be notified of new
clients but only when one is lost.
- Review transactional isolation of various methods
Some methods might not implement proper transaction isolation when they
should. An example is object history (undoLog), which can see data
committed by future transactions.
Storage
- Use HailDB instead of a stand-alone MySQL server.
- Notify master when storage becomes available for clients (LATENCY)
Currently, storage presence is broadcasted to client nodes too early, as
the storage node would refuse them until it has only up-to-date data (not
only up-to-date cells, but also a partition table and node states).
- Create a specialized PartitionTable that know the database and replicator
to remove duplicates and remove logic from handlers (CODE)
- Consider insert multiple objects at time in the database, with taking care
of maximum SQL request size allowed. (SPEED)
- Prevent from SQL injection, escape() from MySQLdb api is not sufficient,
consider using query(request, args) instead of query(request % args)
- Make listening address and port optionnal, and if they are not provided
listen on all interfaces on any available port.
- Make replication speed configurable (HIGH AVAILABILITY)
In its current implementation, replication runs at lowest priority, not to
degrades performance for client nodes. But when there's only 1 storage
left for a partition, it may be wanted to guarantee a minimum speed to
avoid complete data loss if another failure happens too early.
- Pack segmentation & throttling (HIGH AVAILABILITY)
In its current implementation, pack runs in one call on all storage nodes
at the same time, which lcoks down the whole cluster. This task should
be split in chunks and processed in "background" on storage nodes.
Packing throttling should probably be at the lowest possible priority
(below interactive use and below replication).
- tpc_finish failures propagation to master (FUNCTIONALITY)
When asked to lock transaction data, if something goes wrong the master
node must be informed.
- Verify data checksum on reception (FUNCTIONALITY)
In current implementation, client generates a checksum before storing,
which is only checked upon load. This doesn't prevent from storing
altered data, which misses the point of having a checksum, and creates
weird decisions (ex: if checksum verification fails on load, what should
be done ? hope to find a storage with valid checksum ? assume that data
is correct in storage but was altered when it travelled through network
as we loaded it ?).
- Check replicas: (HIGH AVAILABILITY)
- Automatically tell corrupted cells to fix their data when a good source
is known.
- Add an option to also check all rows of trans/obj/data, instead of only
keys (trans.tid & obj.{tid,oid}).
Master
- Master node data redundancy (HIGH AVAILABILITY)
Secondary master nodes should replicate primary master data (ie, primary
master should inform them of such changes).
This data takes too long to extract from storage nodes, and losing it
increases the risk of starting from underestimated values.
This risk is (currently) unavoidable when all nodes stop running, but this
case must be avoided.
- Differential partition table updates (BANDWITH)
When a storage asks for current partition table (when it connects to a
cluster in service state), it must update its knowledge of the partition
table. Currently it's done by fetching the entire table. If the master
keeps a history of a few last changes to partition table, it would be able
to only send a differential update (via the incremental update mechanism)
- During recovery phase, store multiple partition tables (ADMINISTATION)
When storage nodes know different version of the partition table, the
master should be abdle to present them to admin to allow him to choose one
when moving on to next phase.
- Optimize operational status check by recording which rows are ready
instead of parsing the whole partition table. (SPEED)
- Improve partition table tweaking algorithm to reduce differences between
frequently and rarely used nodes (SCALABILITY)
- tpc_finish failures propagation to client (FUNCTIONALITY)
When a storage node notifies a problem during lock/unlock phase, an error
must be propagated to client.
Client
- Implement C version of mq.py (LOAD LATENCY)
- Use generic bootstrap module (CODE)
- Find a way to make ask() from the thread poll to allow send initial packet
(requestNodeIdentification) from the connectionCompleted() event instead
of app. This requires to know to what thread will wait for the answer.
- Discuss about dead storage notification. If a client fails to connect to
a storage node supposed in running state, then it should notify the master
to check if this node is well up or not.
- Implement restore() ZODB API method to bypass consistency checks during
imports.
- tpc_finish failures (FUNCTIONALITY)
New failure cases during tpc_finish must be handled.
- Fix and reenable deadlock avoidance (SPEED). This is required for
neo.threaded.test.Test.testDeadlockAvoidance
Admin
- Make admin node able to monitor multiple clusters simultaneously
- Send notifications (ie: mail) when a storage node is lost
Tests
- Use another mock library that is eggified and maintained.
See http://garybernhardt.github.com/python-mock-comparison/
for a comparison of available mocking libraries/frameworks.
- Fix epoll descriptor leak.
Later
- Consider auto-generating cluster name upon initial startup (it might
actualy be a partition property).
- Consider ways to centralise the configuration file, or make the
configuration updatable automaticaly on all nodes.
- Consider storing some metadata on master nodes (partition table [version],
...). This data should be treated non-authoritatively, as a way to lower
the probability to use an outdated partition table.
- Decentralize primary master tasks as much as possible (consider
distributed lock mechanisms, ...)
- Choose how to compute the storage size
- Make storage check if the OID match with it's partitions during a store
- Investigate delta compression for stored data
Idea would be to have a few most recent revisions being stored fully, and
older revision delta-compressed, in order to save space.