TODO 12.9 KB
Newer Older
1 2 3 4 5 6 7
RC = Release Critical (for next release)

  Documentation
    - Clarify node state signification, and consider renaming them in the code.
      Ideas:
        TEMPORARILY_DOWN becomes UNAVAILABLE
        BROKEN is removed ?
8 9 10
    - Clarify the use of each error codes:
      - NOT_READY removed (connection kept opened until ready)
      - Split PROTOCOL_ERROR (BAD IDENTIFICATION, ...)
11 12 13 14 15
RC  - Clarify cell state signification
    - Add docstrings (think of doctests)

  Code

16
    Code changes often impact more than just one node. They are categorised by
17
    node where the most important changes are needed.
18 19 20 21 22

    General
RC  - Review XXX in the code (CODE)
RC  - Review TODO in the code (CODE)
RC  - Review output of pylint (CODE)
23
    - Keep-alive (HIGH AVAILABILITY) (implemented, to be reviewed and tested)
24 25
      Consider the need to implement a keep-alive system (packets sent
      automatically when there is no activity on the connection for a period
26
      of time).
27
    - Factorise packet data when sending partition table cells (BANDWITH)
28
      Currently, each cell in a partition table update contains UUIDs of all
29
      involved nodes.
30
      It must be changed to a correspondance table using shorter keys (sent
31
      in the packet) to avoid repeating the same UUIDs many times.
32
    - Consider using multicast for cluster-wide notifications. (BANDWITH)
33
      Currently, multi-receivers notifications are sent in unicast to each
34
      receiver. Multicast should be used.
35
    - Remove sleeps (LATENCY, CPU WASTE)
36 37 38
      Code still contains many delays (explicit sleeps or polling timeouts).
      They must be removed to be either infinite (sleep until some condition
      becomes true, without waking up needlessly in the meantime) or null
39
      (don't wait at all).
40
    - Implements delayed connection acceptation.
41
      Currently, any node that connects to early to another that is busy for
42 43 44 45 46 47 48 49
      some reasons is immediately rejected with the 'not ready' error code. This
      should be replaced by a queue in the listening node that keep a pool a
      nodes that will be accepted late, when the conditions will be satisfied.
      This is mainly the case for :
        - Client rejected before the cluster is operational
        - Empty storages rejected during recovery process
      Masters implies in the election process should still reject any connection
      as the primary master is still unknown.
50
    - Connections must support 2 simultaneous handlers (CODE)
51 52
      Connections currently define only one handler, which is enough for
      monothreaded code. But when using multithreaded code, there are 2
53
      possible handlers involved in a packet reception:
54
      - The first one handles notifications only (nothing special to do
55
        regarding multithreading)
56
      - The second one handles expected messages (such message must be
57
        directed to the right thread)
58
      The second handler must be possible to set on the connection when that
59
      connection is thread-safe (MT version of connection classes).
60
      Also, the code to detect wether a response is expected or not must be
61
      genericised and moved out of handlers.
62 63 64 65 66
    - Implement transaction garbage collection API (FEATURE)
      NEO packing implementation does not update transaction metadata when
      deleting object revisions. This inconsistency must be made possible to
      clean up from a client application, much in the same way garbage
      collection part of packing is done.
67
    - Factorise node initialisation for admin, client and storage (CODE)
68 69
      The same code to ask/receive node list and partition table exists in too
      many places.
70 71
    - Clarify handler methods to call when a connection is accepted from a
      listening conenction and when remote node is identified
72
      (cf. neo/lib/bootstrap.py).
73 74 75
    - Choose how to handle a storage integrity verification when it comes back.
      Do the replication process, the verification stage, with or without
      unfinished transactions, cells have to set as outdated, if yes, should the
76
      partition table changes be broadcasted ? (BANDWITH, SPEED)
77
    - Implement proper shutdown (ClusterStates.STOPPING)
78
    - Review PENDING/HIDDEN/SHUTDOWN states, don't use notifyNodeInformation()
79
      to do a state-switch, use a exception-based mechanism ? (CODE)
80
    - Split protocol.py in a 'protocol' module ?
81 82
    - Review handler split (CODE)
      The current handler split is the result of small incremental changes. A
83
      global review is required to make them square.
84 85 86 87 88 89
    - Make handler instances become singletons (SPEED, MEMORY)
      In some places handlers are instanciated outside of App.__init__ . As a
      handler is completely re-entrant (no modifiable properties) it can and
      should be made a singleton (saves the CPU time needed to instanciates all
      the copies - often when a connection is established, saves the memory
      used by each copy).
90 91 92
    - Consider replace setNodeState admin packet by one per action, like
      dropNode to reduce packet processing complexity and reduce bad actions
      like set a node in TEMPORARILY_DOWN state.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
93 94
    - Review node notfications. Eg. A storage don't have to be notified of new
      clients but only when one is lost.
Vincent Pelletier's avatar
Vincent Pelletier committed
95 96 97 98
    - Review transactional isolation of various methods
      Some methods might not implement proper transaction isolation when they
      should. An example is object history (undoLog), which can see data
      committed by future transactions.
99 100

    Storage
101
    - Use HailDB instead of a stand-alone MySQL server.
102
    - Notify master when storage becomes available for clients (LATENCY)
103 104
      Currently, storage presence is broadcasted to client nodes too early, as
      the storage node would refuse them until it has only up-to-date data (not
105
      only up-to-date cells, but also a partition table and node states).
Grégory Wisniewski's avatar
Grégory Wisniewski committed
106 107
    - Create a specialized PartitionTable that know the database and replicator
      to remove duplicates and remove logic from handlers (CODE)
108
    - Consider insert multiple objects at time in the database, with taking care
109
      of maximum SQL request size allowed. (SPEED)
110
    - Prevent from SQL injection, escape() from MySQLdb api is not sufficient,
111
      consider using query(request, args) instead of query(request % args)
112 113
    - Make listening address and port optionnal, and if they are not provided
      listen on all interfaces on any available port.
114 115 116 117
    - Replication throttling (HIGH AVAILABILITY)
      In its current implementation, replication runs at full speed, which
      degrades performance for client nodes. Replication should allow
      throttling, and that throttling should be configurable.
Vincent Pelletier's avatar
Vincent Pelletier committed
118
      See "Replication pipelining".
119 120 121 122 123 124
    - Pack segmentation & throttling (HIGH AVAILABILITY)
      In its current implementation, pack runs in one call on all storage nodes
      at the same time, which lcoks down the whole cluster. This task should
      be split in chunks and processed in "background" on storage nodes.
      Packing throttling should probably be at the lowest possible priority
      (below interactive use and below replication).
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
    - Replication pipelining (SPEED)
      Replication work currently with too many exchanges between replicating
      storage, and network latency can become a significant limit.
      This should be changed to have just one initial request from
      replicating storage, and multiple packets from reference storage with
      database range checksums. When receiving these checksums, replicating
      storage must compare with what it has, and ask row lists (might not even
      be required) and data when there are differences. Quick fetching from
      network with asynchronous checking (=queueing) + congestion control
      (asking reference storage's to pause its packet flow) will probably be
      required.
      This should make it easier to throttle replication workload on reference
      storage node, as it can decide to postpone replication-related packets on
      its own.
    - Partial replication (SPEED)
      In its current implementation, replication always happens on a whole
      partition. In typical use, only a few last transactions will have been
      missed, so replicating only past a given TID would be much faster.
      To achieve this, storage nodes must store 2 values:
      - a pack identifier, which must be different each time a pack occurs
        (increasing number sequence, TID-ish, etc) to trigger a
        whole-partition replication when a pack happened (this could be
        improved too, later)
      - the latest (-ish) transaction committed locally, to use as a lower
        replication boundary
    - tpc_finish failures propagation to master (FUNCTIONALITY)
      When asked to lock transaction data, if something goes wrong the master
      node must be informed.
153 154 155 156 157 158 159 160
    - Verify data checksum on reception (FUNCTIONALITY)
      In current implementation, client generates a checksum before storing,
      which is only checked upon load. This doesn't prevent from storing
      altered data, which misses the point of having a checksum, and creates
      weird decisions (ex: if checksum verification fails on load, what should
      be done ? hope to find a storage with valid checksum ? assume that data
      is correct in storage but was altered when it travelled through network
      as we loaded it ?).
161 162 163

    Master
    - Master node data redundancy (HIGH AVAILABILITY)
164
      Secondary master nodes should replicate primary master data (ie, primary
165
      master should inform them of such changes).
166
      This data takes too long to extract from storage nodes, and losing it
167 168 169
      increases the risk of starting from underestimated values.
      This risk is (currently) unavoidable when all nodes stop running, but this
      case must be avoided.
170
    - Differential partition table updates (BANDWITH)
171 172 173
      When a storage asks for current partition table (when it connects to a
      cluster in service state), it must update its knowledge of the partition
      table. Currently it's done by fetching the entire table. If the master
174 175
      keeps a history of a few last changes to partition table, it would be able
      to only send a differential update (via the incremental update mechanism)
176
    - During recovery phase, store multiple partition tables (ADMINISTATION)
177
      When storage nodes know different version of the partition table, the
178 179
      master should be abdle to present them to admin to allow him to choose one
      when moving on to next phase.
180 181
    - Optimize operational status check by recording which rows are ready
      instead of parsing the whole partition table. (SPEED)
182 183
    - Improve partition table tweaking algorithm to reduce differences between
      frequently and rarely used nodes (SCALABILITY)
184 185 186
    - tpc_finish failures propagation to client (FUNCTIONALITY)
      When a storage node notifies a problem during lock/unlock phase, an error
      must be propagated to client.
187 188 189 190

    Client
    - Implement C version of mq.py (LOAD LATENCY)
    - Use generic bootstrap module (CODE)
191 192 193
    - Find a way to make ask() from the thread poll to allow send initial packet
      (requestNodeIdentification) from the connectionCompleted() event instead
      of app. This requires to know to what thread will wait for the answer.
Vincent Pelletier's avatar
Typo.  
Vincent Pelletier committed
194
    - Discuss about dead storage notification. If a client fails to connect to
195 196
      a storage node supposed in running state, then it should notify the master
      to check if this node is well up or not.
197 198
    - Implement restore() ZODB API method to bypass consistency checks during
      imports.
199 200
    - tpc_finish failures (FUNCTIONALITY)
      New failure cases during tpc_finish must be handled.
201
    - Fix and reenable deadlock avoidance (SPEED). This is required for
202
      neo.threaded.test.Test.testDeadlockAvoidance
203

204 205 206 207
    Admin
    - Make admin node able to monitor multiple clusters simultaneously
    - Send notifications (ie: mail) when a storage node is lost

Julien Muchembled's avatar
Julien Muchembled committed
208 209 210 211
    Tests
    - Use another mock library that is eggified and maintained.
      See http://garybernhardt.github.com/python-mock-comparison/
      for a comparison of available mocking libraries/frameworks.
212
    - Fix epoll descriptor leak.
Julien Muchembled's avatar
Julien Muchembled committed
213

214
  Later
215
    - Consider auto-generating cluster name upon initial startup (it might
216
      actualy be a partition property).
217
    - Consider ways to centralise the configuration file, or make the
218 219
      configuration updatable automaticaly on all nodes.
    - Consider storing some metadata on master nodes (partition table [version],
220
      ...). This data should be treated non-authoritatively, as a way to lower
221
      the probability to use an outdated partition table.
222
    - Decentralize primary master tasks as much as possible (consider
223
      distributed lock mechanisms, ...)
224
    - Choose how to compute the storage size
225
    - Make storage check if the OID match with it's partitions during a store
226 227 228
    - Investigate delta compression for stored data
      Idea would be to have a few most recent revisions being stored fully, and
      older revision delta-compressed, in order to save space.
229