bigfile/py: Don't forget to clear exception state after retrieving pybuf referrers
A buffer object (pybuf) is passed by C-level loadblk to python loadblk implementation. Since pybuf points to memory that will go away after loadblk call returns to virtmem, PyBigFile tries hard to make sure nothing stays referencing pybuf so it can be released. It tries to: 1. automatically GC cycles referencing pybuf (9aa6a5d7 "bigfile/py: Teach loadblk() to automatically break reference cycles to pybuf") 2. replace pybuf with stub object if a calling frame referencing it still stays alive (61b18a40 "bigfile/py/loadblk: Replace pybuf with a stub object in calling frame in case it stays alive") 3. and as a last resort unpins pybuf from original buffer memeory to point it to NULL (024c246c "bigfile/py/loadblk: Resort to pybuf unpinning, if nothing helps") Step #1 invokes GC. Step #2 calls gc.get_referrers(pybuf) and looks for frames in there. The gc.get_referrers() call happens at python level with allocating some objects, e.g. tuple to pass arguments, resulting list etc. And we all know that any object allocation might cause automatic garbage collection, and GC'ing can in turn ran arbitrary code due to __del__ in release objects and weakrefs callbacks. At a first glance the scenario that GC will be triggered at step #2 looks unrealistic because the GC was just run at step #1 and it is only a few objects being allocated for the call at step #2. However if arbitrary code runs from under GC it can create new garbage and thus upon returning from gc.collect() the garbage list is not empty as the following program demonstrates: ---- 8< ---- import gc # just an object we can set attributes on class X: pass # call f on __del__ class DelCall: def __init__(self, f): self.f = f def __del__(self): self.f() # _mkgarbage creates n objects of garbage kept referenced from an object cycle # so that only cyclic GC can free them. def _mkgarbage(n): # cycle a, b = X(), X() a.b, b.a = b, a # cycle references [n] garbage a.objv = [X() for _ in range(n)] return a # mkgarbage creates cycled garbage and arranges for twice more garbage to be # created when original garbage is collected def mkgarbage(n): a = _mkgarbage(n) a.ondel = DelCall(lambda : _mkgarbage(2*n)) def main(): for i in xrange(10): mkgarbage(1000) print '> %s' % (gc.get_count(),) n = gc.collect() print '< %s' % (gc.get_count(),) main() ---- 8< ---- kirr@deco:~/tmp/trashme/t$ ./gcmoregarbage.py > (482, 11, 0) < (1581, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) > (531, 3, 0) < (2070, 0, 0) here lines starting with "<" show amount of live garbage objects after gc.collect() call has been finished. This way on a busy server there could be arrangements when GC is explicitly ran at step #1 and then automatically run at step #2 (because of gc.get_referrers() python-level call) and from under GC #2 arbitrary code runs thus potentially mutating exception state which shows in logs as bigfile/_bigfile.c:685: pybigfile_loadblk: Assertion `!(ts->exc_type || ts->exc_value || ts->exc_traceback)' failed. ---- So don't assume we end with clean exception state after collecting pybuf referrers and just clear exception state once again as we do after explicit GC. Don't make a similar assumption for buffer unpinning as an object is decrefed there and in theory this can run some code. A test is added to automatically exercise exception state clearing for get_referrers code path via approach similar to demonstrated in above - - we generate more garbage from under garbage and also arrange for finalizers, which mutate exceptions state, to be run at GC #2. The test without the fix applied fails like this: bigfile/_bigfile.c:710 pybigfile_loadblk WARN: python thread-state found with handled but not cleared exception state bigfile/_bigfile.c:711 pybigfile_loadblk WARN: I will dump it and then crash ts->exc_type: None ts->exc_value: <nil> ts->exc_traceback: <nil> Segmentation fault (core dumped) The None in ts->exc_type and nil value and traceback are probably coming from here in cpython runtime: https://github.com/python/cpython/blob/883520a8/Python/ceval.c#L3717 Since this took some time to find, more diagnostics is also added before BUG_ONs corresponding to finding unclean exception state.
1 parent 3804cc39