-
Marko Mäkelä authored
sync_array_print_long_waits(): Return the longest waiting thread ID and the longest waited-for lock. Only if those remain unchanged between calls in srv_error_monitor_thread(), increment fatal_cnt. Otherwise, reset fatal_cnt. Background: There is a built-in watchdog in InnoDB whose purpose is to kill the server when some thread is stuck waiting for a mutex or rw-lock. Before this fix, the logic was flawed. The function sync_array_print_long_waits() returns TRUE if it finds a lock wait that exceeds 10 minutes (srv_fatal_semaphore_wait_threshold). The function srv_error_monitor_thread() will kill the server if this happens 10 times in a row (fatal_cnt reaches 10), checked every 30 seconds. This is wrong, because this situation does not mean that the server is hung. If the server is very busy for a little over 15 minutes, it will be killed. Consider this example. Thread T1 is waiting for mutex M. Some time later, threads T2..Tn start waiting for the same mutex M. If T1 keeps waiting for 600 seconds, fatal_cnt will be incremented to 1. So far, so good. Now, if M is granted to T1, the server was obviously not stuck. But, T2..Tn keeps waiting, and their wait time will be longer than 600 seconds. If 5 minutes later, some Tn has still been waiting for more than 10 minutes for the mutex M, the server can be killed, even though it is not stuck. rb:622 approved by Jimmy Yang
ddec6ecd