In wait(), when there's no asyncore main loop, we called
asyncore.poll() with a timeout of 10 seconds. Change this to a variable timeout starting at 1 msec and doubling until 1 second. While debugging Win2k crashes in the check4ExtStorageThread test from ZODB/tests/MTStorage.py, Tim noticed that there were frequent 10 second gaps in the log file where *nothing* happens. These were caused by the following scenario. Suppose a ZEO client process has two threads using the same connection to the ZEO server, and there's no asyncore loop active. T1 makes a synchronous call, and enters the wait() function. Then T2 makes another synchronous call, and enters the wait() function. At this point, both are blocked in the select() call in asyncore.poll(), with a timeout of 10 seconds (in the old version). Now the replies for both calls arrive. Say T1 wakes up. The handle_read() method in smac.py calls self.recv(8096), so it gets both replies in its buffer, decodes both, and calls self.message_input() for both, which sticks both replies in the self.replies dict. Now T1 finds its response, its wait() call returns with it. But T2 is still stuck in asyncore.poll(): its select() call never woke up, and has to "sit out" the whole timeout of 10 seconds. (Good thing I added timeouts to everything! Or perhaps not, since it masked the problem.) One other condition must be satisfied before this becomes a disaster: T2 must have started a transaction, and all other threads must be waiting to start another transaction. This is what I saw in the log. (Hmm, maybe a message should be logged when a thread is waiting to start a transaction this way.) In a real Zope application, this won't happen, because there's a centralized asyncore loop in a separate thread (probably the client's main thread) and the various threads would be waiting on the condition variable; whenever a reply is inserted in the replies dict, all threads are notified. But in the test suite there's no asyncore loop, and I don't feel like adding one. So the exponential backoff seems the easiest "solution".
Showing
Please register or sign in to comment