fsrefs: Optimize IO (take 2) (#340)
* fsrefs: Optimize IO (take 2) Access objects in the order of their position in file instead of in the order of their OID. This should give dramatical speedup when data are on HDD. For example @perrinjerome reports that on a 73Go database it takes almost 8h to run fsrefs (where on the same database, fstest takes 15 minutes) [1,2]. After the patch fsrefs took ~80 minutes to run on the same database. In other words this is ~ 6x improvement. Fsrefs has no tests. I tested it only lightly via generating a bit corrupt database with deleted referred object(*), and it gives the same output as unmodified fsrefs. oid 0x0 __main__.Object last updated: 1979-01-03 21:00:42.900001, tid=0x285cbacb70a3db3 refers to invalid objects: oid 0x07 missing: '<unknown>' oid 0x07 object creation was undone: '<unknown>' This "take 2" version is derived from https://github.com/zopefoundation/ZODB/pull/338 and only iterates objects in the order of their in-file position without building complete references graph in-RAM, because that in-RAM graph would consume ~12GB of memory. Added pos2oid in-RAM index also consumes memory: for the 73GB database in question fs._index takes ~700MB, while pos2oid takes ~2GB. In theory it could be less, because we need only array of oid sorted by key(oid)=fs._index[oid]. However array.array does not support sorting, and if we use plain list to keep just []oid, the memory consumption just for that list is ~5GB. Also because list.sort(key=...) internally allocates memory for key array (and list.sort(cmp=...) was removed from Python3), total memory consumption just to produce list of []oid ordered by pos is ~10GB. So without delving into C/Cython and/or manually sorting the array in Python (= slow), using QQBTree seems to be the best out-of-the-box option for oid-by-pos index. [1] nexedi/zodbtools!19 (comment 129480) [2] nexedi/zodbtools!19 (comment 129551) (*) test database generated via a bit modified gen_testdata.py from zodbtools: https://lab.nexedi.com/nexedi/zodbtools/blob/v0.0.0.dev8-28-g129afa6/zodbtools/test/gen_testdata.py + ```diff --- a/zodbtools/test/gen_testdata.py +++ b/zodbtools/test/gen_testdata.py @@ -229,7 +229,7 @@ def ext(subj): return {} # delete an object name = random.choice(list(root.keys())) obj = root[name] - root[name] = Object("%s%i*" % (name, i)) +# root[name] = Object("%s%i*" % (name, i)) # NOTE user/ext are kept empty on purpose - to also test this case commit(u"", u"predelete %s" % unpack64(obj._p_oid), {}) ``` /cc @tim-one, @jeremyhylton, @jamadden /reviewed-by @jamadden, @perrinjerome /reviewed-on https://github.com/zopefoundation/ZODB/pull/340
Showing
Please register or sign in to comment