- 18 Dec, 2019 1 commit
-
-
Kirill Smelkov authored
It was from long-ago marked as "XXX move to common place".
-
- 12 Jul, 2019 2 commits
-
-
Kirill Smelkov authored
For tests this makes sure that if one test fails, it won't make following tests fail just because the next test will fail trying to lock test database. For regular code (demo_zbigarray.py) this is also a good thing to do - to always close the database irregardless of whether an exception was raised before program reached end of main. Pygolang becomes regular - not test only - dependency. Being regular dependency is currently required only by demo_zbigarray.py, but it will be also used in upcoming wcfs, so adding pygolang into wendelin.core dependencies aligns with the plan. dbclose now uses defer almost everywhere - there are still few places in tests, where one test function is opening/closing test database multiple times - those were not (yet ?) converted.
-
Kirill Smelkov authored
Instead of raises(Exception, 'code') do with raises(Exception): code This removes lots of warnings, similar to below example: bigfile/tests/test_basic.py::test_basic /home/kirr/src/wendelin/wendelin.core/bigfile/tests/test_basic.py:79: PytestDeprecationWarning: raises(..., 'code(as_a_string)') is deprecated, use the context manager form or use `exec()` directly See https://docs.pytest.org/en/latest/deprecations.html#raises-warns-exec raises(ROAttributeError, "f.blksize = 1") # RO attribute
-
- 12 Oct, 2018 1 commit
-
-
Kirill Smelkov authored
RAMArray is compatible to ZBigArray in API and semantic, but stores its data in RAM only. It is useful in situations where ZBigArray compatible data type is needed, but the amount of data is small and the data itself is needed only temporarily - e.g. in a simulation. Implementation is based on mmapping temporary files from /dev/shm/... and passing them as file handles, similarly to how ZBigArray works, to BigArray. We don't use just numpy.ndarray because of append - for ZBigArray append works in O(1), but more importantly it does not copy data. This way mmapings previously created for ZBigArray views, continue to correctly alias array data. If we would be using ndarray directly, since ndarray.resize copies data, that property would not be preserved. Original patch by Klaus Wölfel <klaus@nexedi.com> (nexedi/wendelin.core!8)
-
- 11 Oct, 2018 1 commit
-
-
Kirill Smelkov authored
bigarray/tests: Factor out a way to spcify on which BigFile/BigFileH an array is tested into fixture parameter Currently we have only one BigFile and its BigFileH handle. However in the next patch, for RAMArray, we'll be adding handles for opened RAM files, and it would be good to test whole BigArray functionality on data served by those handles too. Prepare for this and first factor out into testbig fixture the way to open such handles.
-
- 02 Apr, 2018 2 commits
-
-
Kirill Smelkov authored
Rationale --------- Array reference could be useful in situations where one needs to pass arrays between processes and instead of copying array data, leverage the fact that top-level array, for example ZBigArray, is already persisted separately, and only send small amount of information referencing data in question. Implementation -------------- BigArray is not regular NumPy array and so needs explicit support in ArrayRef code to find root object and indices. This patch adds such support via the following way: - when BigArray.__getitem__ creates VMA, it remembers in the VMA the top-level BigArray object under which this VMA was created. - when ArrayRef is finding root, it can detect such VMAs, because it will be pointed to by the most top regular ndarray's .base, and in turn gets top-level BigArray object from the VMA. - further all indices computations are performed, similarly to complete regular ndarrays case, on ndarrays root and a. But in the end .lo and .hi are adjusted for the corresponding offset of where root is inside whole BigArray. - there is no need to adjust .deref() at all. For remembering information into a VMA and also to be able to get (readonly) its mapping addresses _bigfile.c extension has to be extended a bit. Since we are now storing arbitrary python object attached to PyVMA - it can create cycles - and so PyVMA accordingly adjusted to support cyclic garbage collector. Please see the patch itself for more details and comments.
-
Kirill Smelkov authored
ArrayRef is a tool to find out for a NumPy array its top-level root parent and remember instructions how to recreate original array from the root. For example if root = arange(1E7) z = root[1000:2000] a = z[10:20] `ArrayRef(a)` will find out that the root array for `a` is `root` and that `a` occupies 1010:1020 bytes in it. The vice versa operation is also possible, for example given aref = ArrayRef(a) it is possible to restore original `a` from `aref`: a_ = aref.deref() assert array_equal(a_, a) the restoration works without copying by creating appropriate view of root. ArrayRef should work reliably for arrays of arbitrary dimensions, strides etc - even fancy arrays created via stride tricks such as arrays whose elements overlap each other should be supported. This patch adds ArrayRef with support for regular ndarrays only. The next patch will add ArrayRef support for BigArray and description for ArrayRef rationale.
-
- 24 Oct, 2017 1 commit
-
-
Kirill Smelkov authored
Relicense to GPLv3+ with wide exception for all Free Software / Open Source projects + Business options. Nexedi stack is licensed under Free Software licenses with various exceptions that cover three business cases: - Free Software - Proprietary Software - Rebranding As long as one intends to develop Free Software based on Nexedi stack, no license cost is involved. Developing proprietary software based on Nexedi stack may require a proprietary exception license. Rebranding Nexedi stack is prohibited unless rebranding license is acquired. Through this licensing approach, Nexedi expects to encourage Free Software development without restrictions and at the same time create a framework for proprietary software to contribute to the long term sustainability of the Nexedi stack. Please see https://www.nexedi.com/licensing for details, rationale and options.
-
- 16 Mar, 2017 1 commit
-
-
Kirill Smelkov authored
Remove debug print leftover from added test in e44bd761.
-
- 10 Mar, 2017 1 commit
-
-
Kirill Smelkov authored
From time to time people keep trying to use wendelin.core with dtype=object arrays and get segfaults without anything in logs or whatever else. Wendelin.core does not support it, because in case of dtype=object elements are really pointers and data for each object is stored in separate place in RAM with different per-object size. As we are memory-mapping arrays this won't work. It also does not essentially work for numpy.memmap for the same reason: (z4+numpy) kirr@mini:~/src/wendelin$ dd if=/dev/zero of=zero.dat bs=128 count=1 1+0 records in 1+0 records out 128 bytes copied, 0.000209873 s, 610 kB/s (z4+numpy) kirr@mini:~/src/wendelin$ dd if=/dev/urandom of=random.dat bs=128 count=1 1+0 records in 1+0 records out 128 bytes copied, 0.000225726 s, 567 kB/s (z4+numpy) kirr@mini:~/src/wendelin$ ipython ... In [1]: import numpy as np In [2]: np.memmap('zero.dat', dtype=np.object) Out[2]: memmap([None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None], dtype=object) In [3]: np.memmap('random.dat', dtype=np.object) Out[3]: Segmentation fault So let's clarify this to users via explicitly raising exception when BigArray with non-appropriate dtype is trying to be created with descriptive explanation also logged. /reviewed-on nexedi/wendelin.core!4
-
- 18 Dec, 2015 1 commit
-
-
Kirill Smelkov authored
Commit ab9ca2df (bigarray: Add support for FORTRAN ordering) added ability to define array order, but there I made a mistake of not caring about how previously-saved to DB arrays would be read back. The thing is BigArray gained new data member ._order which is automatically saved to DB thanks to ZBigArray inheriting from Persistent; on load-from-db path we just read object state from DB, which for ZBigArray is dict, and restore object attributes from it. But for previously-saved data, obviously, there is no 'order' entry and thus this way restored objects are restored not in full to current code expectations and it can boom e.g. this way: zarray.resize((new_one,old_shape[1])) Module wendelin.bigarray, line 190, in resize self._init0(new_shape, self.dtype, order=self._order) AttributeError: 'ZBigArray' object has no attribute '_order' Solution to fix is: on restore-from-DB path, see if a data member is not present on restored object, and if it has default value in BigArray set it to that. ( code to get function defaults is from http://stackoverflow.com/questions/12627118/get-a-function-arguments-default-value ) /cc @Tyagov, @klaus
-
- 15 Dec, 2015 1 commit
-
-
Kirill Smelkov authored
NotifyChannel was introduced in c7c01ce4 (bigfile/zodb: ZODB.Connection can migrate between threads on close/open and we have to care) to test thread interaction specific to ZODB. We'll however need NotifyChannel to do more threading test of virtmem core, and this way the proper place for NotifyChannel is test_thread.py itself. Move it.
-
- 02 Nov, 2015 1 commit
-
-
Kirill Smelkov authored
Until now BigArrays could use only C-style ordering - where major index is the first one. Fortran ordering is the opposite - where major index is the last one - and is used in Fortran world and sometimes by scientists in other areas. As people keep on asking for Fortran-ordered BigArrays, let's add support for it. The essential code change is to change 0'th index to major index in __getitem__ and rest of the code. For the reference: ndarray memory layout for different orders is described here: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray
-
- 21 Sep, 2015 2 commits
-
-
Kirill Smelkov authored
When we serve indexing request, we first compute page range in backing file, which contains the result based on major index range, then mmap that file range and pick up result from there. Page range math was however not correct: e.g. for positive strides, last element's byte is (byte0_stop-1), NOT (byte0_stop - byte0_stride) which for cases where byte0_stop is just a bit after page boundary, can make a difference - page_max will be 1 page less what it should be and then whole ndarray view creation breaks: ... Module wendelin.bigarray, line 381, in __getitem__ view0 = ndarray(view0_shape, self._dtype, vma0, view0_offset, view0_stridev) ValueError: strides is incompatible with shape of requested array and size of buffer ( because vma0 was created less in size than what is needed to create view0_shape shaped array starting from view0_offset in vma0. ) Similar story for negative strides math - it was not correct neither. Fix it. /reported-by @Camata
-
Kirill Smelkov authored
We'll need this class in tests in the next patch.
-
- 02 Sep, 2015 1 commit
-
-
Kirill Smelkov authored
bigfile/zodb/tests: Make sure _p_invalidate() in Zblk.loadblk() does not lead to reloading data updated Thanks to ZODB being MVCC this does not happen, but we better test explicitly.
-
- 18 Aug, 2015 3 commits
-
-
Kirill Smelkov authored
e.g. on .shape
-
Kirill Smelkov authored
When there is a conflict (on any object, but on ZBlk in particular) ZODB machinery calls its ._p_invalidate() twice: File ".../wendelin.core/bigfile/tests/test_filezodb.py", line 661, in test_bigfile_filezodb_vs_conflicts tm2.commit() # this should raise ConflictError and stay at 11 state File ".../transaction/_manager.py", line 111, in commit return self.get().commit() File ".../transaction/_transaction.py", line 271, in commit self._commitResources() File ".../transaction/_transaction.py", line 414, in _commitResources self._cleanup(L) File ".../transaction/_transaction.py", line 426, in _cleanup rm.abort(self) File ".../ZODB/Connection.py", line 436, in abort self._abort() File ".../ZODB/Connection.py", line 479, in _abort self._cache.invalidate(oid) File ".../wendelin.core/bigfile/file_zodb.py", line 148, in _p_invalidate traceback.print_stack() and File ".../wendelin.core/bigfile/tests/test_filezodb.py", line 661, in test_bigfile_filezodb_vs_conflicts tm2.commit() # this should raise ConflictError and stay at 11 state File ".../transaction/_manager.py", line 111, in commit return self.get().commit() File ".../transaction/_transaction.py", line 271, in commit self._commitResources() File ".../transaction/_transaction.py", line 416, in _commitResources self._synchronizers.map(lambda s: s.afterCompletion(self)) File ".../transaction/weakset.py", line 59, in map f(elt) File ".../transaction/_transaction.py", line 416, in <lambda> self._synchronizers.map(lambda s: s.afterCompletion(self)) File ".../ZODB/Connection.py", line 831, in _storage_sync self._flush_invalidations() File ".../ZODB/Connection.py", line 539, in _flush_invalidations self._cache.invalidate(invalidated) File ".../wendelin.core/bigfile/file_zodb.py", line 148, in _p_invalidate traceback.print_stack() i.e. first invalidation is done by commit cleanup: https://github.com/zopefoundation/transaction/blob/1.4.4/transaction/_transaction.py#L414 https://github.com/zopefoundation/ZODB/blob/3.10/src/ZODB/Connection.py#L479 and then Connection.afterCompletion() flushes invalidation again: https://github.com/zopefoundation/transaction/blob/1.4.4/transaction/_transaction.py#L416 https://github.com/zopefoundation/ZODB/blob/3.10/src/ZODB/Connection.py#L833 https://github.com/zopefoundation/ZODB/blob/3.10/src/ZODB/Connection.py#L539 If there was no conflict - there will be no ConflictError raised and thus no Transaction._cleanup() done in its ._commitResources() -> invalidation called only once. But with ConflictError - it is twice. Adjust ZBlk._p_invalidate() not to delve into real invalidation more than once - else we will fail, as ZBlk._v_zfile becomes unbound after invalidation done the first time.
-
Kirill Smelkov authored
All is currently handled correctly, but an observation is made that upon such invalidation we through away ._v_fileh i.e. we through away whole data cache just because an array was resized.
-
- 17 Aug, 2015 1 commit
-
-
Kirill Smelkov authored
Continuing theme from the previous patch, here is propagation of invalidation messages from ZODB to BigFileH memory. The use-case here is that e.g. one fileh mapping was created in one connection, another in another, and after doing changes in second connection and committing there, the first fileh has to invalidate appropriate already-loaded pages, so its next transaction won't work with stale data. To do it, we hook into ZBlk._p_invalidate() and propagate the invalidation message to ZBigFile which then notifies all opened-through-it ZBigFileH to invalidate a page. ZBlk -> ZBigFile lookup is done without storing backpointer in ZODB - instead, every time ZBigFile touches ZBlk object (and thus potentially does GHOST -> Live transition to it), we (re-)bind it back to ZBigFile. Since ZBigFile is the only class that works with ZBlk objects it is safe to do so. For ZBigFile to notify "all-opened-through-it" ZBigFileH, a weakset is introduced to track them. Otherwise the real page invalidation work is done by virtmem (see previous patch).
-
- 12 Aug, 2015 2 commits
-
-
Kirill Smelkov authored
Intro ----- ZODB maintains pool of opened-to-DB connections. For each request Zope opens 1 connection and, after request handling is done, returns the connection back to ZODB pool (via Connection.close()). The same connection will be opened again for handling some future next request at some future time. This next open can happen in different-from-first request worker thread. TransactionManager (as accessed by transaction.{get,commit,abort,...}) is thread-local, that is e.g. transaction.get() returns different transaction for threads T1 and T2. When _ZBigFileH hooks into txn_manager to get a chance to run its .beforeCompletion() when transaction.commit() is run, it hooks into _current_ _thread_ transaction manager. Without unhooking on connection close, and circumstances where connection migrates to different thread this can lead to dissynchronization between ZBigFileH managing fileh pages and Connection with ZODB objects. And even to data corruption, e.g. T1 T2 open zarray[0] = 11 commit close open # opens connection as closed in T1 open zarray[0] = 21 commit abort close close Here zarray[0]=21 _will_ be committed by T1 as part of T1 transaction - because when T1 does commit .beforeCompletion() for zarray is invoked, sees there is dirty data and propagate changes to zodb objects in connection for T2, joins connection for T2 into txn for T1, and then txn for t1 when doing two-phase-commit stores modified objects to DB -> oops. ---------------------------------------- To prevent such dissynchronization _ZBigFileH needs to be a DataManager which works in sync with the connection it was initially created under - on connection close, unregister from transaction_manager, and on connection open, register to transaction manager in current, possibly different, thread context. Then there won't be incorrect beforeCompletion() notification and corruption. This issue, besides possible data corruption, was probably also exposing itself via following ways we've seen in production (everywhere connection was migrated from T1 to T2): 1. Exception ZODB.POSException.ConnectionStateError: ConnectionStateError('Cannot close a connection joined to a transaction',) in <bound method Cleanup.__del__ of <App.ZApplication.Cleanup instance at 0x7f10f4bab050>> ignored T1 T2 modify zarray commit/abort # does not join zarray to T2.txn, # because .beforeCompletion() is # registered in T1.txn_manager commit # T1 invokes .beforeCompletion() ... # beforeCompletion() joins ZBigFileH and zarray._p_jar (=T2.conn) to T1.txn ... # commit is going on in progress ... ... close # T2 thinks request handling is done and ... # and closes connection. But T2.conn is ... # still joined to T1.txn 2. Traceback (most recent call last): File ".../wendelin/bigfile/file_zodb.py", line 121, in storeblk def storeblk(self, blk, buf): return self.zself.storeblk(blk, buf) File ".../wendelin/bigfile/file_zodb.py", line 220, in storeblk zblk._v_blkdata = bytes(buf) # FIXME does memcpy File ".../ZODB/Connection.py", line 857, in setstate raise ConnectionStateError(msg) ZODB.POSException.ConnectionStateError: Shouldn't load state for 0x1f23a5 when the connection is closed Similar to "1", but close in T2 happens sooner, so that when T1 does the commit and tries to store object to database, Connection refuses to do the store: T1 T2 modify zarray commit/abort commit ... close ... ... . obj.store() ... ... 3. Traceback (most recent call last): File ".../wendelin/bigfile/file_zodb.py", line 121, in storeblk def storeblk(self, blk, buf): return self.zself.storeblk(blk, buf) File ".../wendelin/bigfile/file_zodb.py", line 221, in storeblk zblk._p_changed = True # if zblk was already in DB: _p_state -> CHANGED File ".../ZODB/Connection.py", line 979, in register self._register(obj) File ".../ZODB/Connection.py", line 989, in _register self.transaction_manager.get().join(self) File ".../transaction/_transaction.py", line 220, in join Status.ACTIVE, Status.DOOMED, self.status)) ValueError: expected txn status 'Active' or 'Doomed', but it's 'Committing' ( storeblk() does zblk._p_changed -> Connection.register(zblk) -> txn.join() but txn is already committing IOW storeblk() was invoked with txn.state being already 'Committing' ) T1 T2 modify obj # this way T2.conn joins T2.txn modify zarray commit # T1 invokes .beforeCompletion() ... # beforeCompletion() joins only _ZBigFileH to T1.txn ... # (because T2.conn is already marked as joined) ... ... commit/abort # T2 does commit/abort - this touches only T2.conn, not ZBigFileH ... # in particular T2.conn is now reset to be not joined ... . tpc_begin # actual active commit phase of T1 was somehow delayed a bit . tpc_commit # when changes from RAM propagate to ZODB objects associated . storeblk # connection (= T2.conn !) is notified again, . zblk = ... # wants to join txn for it thinks its transaction_manager, # which when called from under T1 returns *T1* transaction manager for # which T1.txn is already in state='Committing' 4. Empty transaction committed to NEO ( different from doing just transaction.commit() without changing any data - a connection was joined to txn, but set of modified object turned out to be empty ) This is probably a race in Connection._register when both T1 and T2 go to it at the same time: https://github.com/zopefoundation/ZODB/blob/3.10/src/ZODB/Connection.py#L988 def _register(self, obj=None): if self._needs_to_join: self.transaction_manager.get().join(self) self._needs_to_join = False T1 T2 modify zarray commit ... .beforeCompletion modify obj . if T2.conn.needs_join if T2.conn.needs_join # race here . T2.conn.join(T1.txn) T2.conn.join(T2.txn) # as a result T2.conn joins both T1.txn and T2.txn . commit finishes # T2.conn registered-for-commit object list is now empty commit tpc_begin storage.tpc_begin tpc_commit # no object stored, because for-commit-list is empty /cc @jm, @klaus, @Tyagov, @vpelletier
-
Kirill Smelkov authored
( without dbclose, next test will not be able to open database - will timeout on open on waiting for FileStorage lock )
-
- 27 Jul, 2015 1 commit
-
-
Kirill Smelkov authored
ca064f75 (bigarray: Support resizing in-place) added O(1) in-place BigArray.resize() which makes possible for users to append data to BigArray in O(δ) time. But it is easy for people to make off-by-one mistakes when calculating indices for append. So provide a convenient BigArray.append() which simplifies the following A # ZBigArray e.g. of shape (N, 3) values # ndarray to append of shape (δ, 3) n, δ = len(A), len(values) # length of A's major index =N A.resize((n+δ, A.shape[1:])) # add δ new entries ; now len(A) =N+δ A[-δ:] = values # set data for last new δ entries into A.append(values) /cc @klaus
-
- 24 Jul, 2015 1 commit
-
-
Kirill Smelkov authored
We stopped using numpy.multiply in 73926487 (*: It is not safe to use multiply.reduce() - it overflows).
-
- 26 Jun, 2015 2 commits
-
-
Kirill Smelkov authored
We compare A_[10*PS-1] (which is A_[1]) to 0, but A_= ndarray ((10*PS,), uint8) and that means the array memory is not initialized. So the comparison works sometimes and sometimes it does not. Initialize compared element explicitly. NOTE: A (without _) element does not need to be initialized - because not-initialized BigArray parts read as zeros.
-
Kirill Smelkov authored
Previously we were always testing with DBs backed up by FileStorage. Now we provide a way to run the testsuite with user selected storage backend: $ WENDELIN_CORE_TEST_DB="<fs>" make test.py # test with temporary db with FileStorage $ WENDELIN_CORE_TEST_DB="<zeo>" make test.py # ----------//---------- with ZEO $ WENDELIN_CORE_TEST_DB="<neo>" make test.py # ----------//---------- with NEO $ WENDELIN_CORE_TEST_DB=neo://db@master make test.py # test with externally provided DB Default is still to run tests with FileStorage. /cc @jm
-
- 25 Jun, 2015 1 commit
-
-
Kirill Smelkov authored
Factor out those routines to open a ZODB database to common place. The reason for doing so is that we'll soon teach dbopen to automatically recognize several protocols, e.g. neo:// and zeo:// and this way, clients who use dbopen() could automatically access storages besides FileStorage.
-
- 02 Jun, 2015 2 commits
-
-
Kirill Smelkov authored
BigArrays can be big - up to 2^64 bytes, and thus in general it is not possible to represent whole BigArray as ndarray view, because address space is usually smaller on 64bit architectures. However users often try to pass BigArrays to numpy functions as-is, and numpy finds a way to convert, or start converting, BigArray to ndarray - via detecting it as a sequence, and extracting elements one-by-one. Which is slooooow. Because of the above, we provide users a well-defined service: - if virtual address space is available - we succeed at creating ndarray view for whole BigArray, without delay and copying. - if not - we report properly the error and give hint how BigArrays have to be processed in chunks. Verifying that big BigArrays cannot be converted to ndarray also tests for behaviour and issues fixed in last 5 patches. /cc @Tyagov /cc @klaus
-
Kirill Smelkov authored
e.g. In [1]: multiply.reduce((1<<30, 1<<30, 1<<30)) Out[1]: 0 instead of In [2]: (1<<30) * (1<<30) * (1<<30) Out[2]: 1237940039285380274899124224 In [3]: 1<<90 Out[3]: 1237940039285380274899124224 also multiply.reduce returns int64, instead of python int: In [4]: type( multiply.reduce([1,2,3]) ) Out[4]: numpy.int64 which also leads to overflow-related problems if we further compute with this value and other integers and results exceeds int64 - it becomes float: In [5]: idx0_stop = 18446744073709551615 In [6]: stride0 = numpy.int64(1) In [7]: byte0_stop = idx0_stop * stride0 In [8]: byte0_stop Out[8]: 1.8446744073709552e+19 and then it becomes a real problem for BigArray.__getitem__() wendelin.core/bigarray/__init__.py:326: RuntimeWarning: overflow encountered in long_scalars page0_min = min(byte0_start, byte0_stop+byte0_stride) // pagesize # TODO -> fileh.pagesize and then > vma0 = self._fileh.mmap(page0_min, page0_max-page0_min+1) E TypeError: integer argument expected, got float ~~~~ So just avoid multiple.reduce() and do our own mul() properly the same way sum() is builtin into python, and we avoid overflow-related problems.
-
- 28 May, 2015 3 commits
-
-
Kirill Smelkov authored
It was hanging with NumPy-1.9 before 425dc5d1 (bigarray: Raise IndexError for out-of-bound element access), because of the following correct NumPy commit: https://github.com/numpy/numpy/commit/d36f8227 and in particular https://github.com/numpy/numpy/commit/d36f8227#diff-6d326badc0872de91e025cbfb0be1aafR522 That PySequence_Fast(obj) (with obj being BigArray) creates iterator on top of obj and before our previous IndexError fix in 425dc5d1, this was looping forever. Test explicitly with both NumPy 1.8 and NumPy 1.9, that this construct does not hang. /cc @Tyagov
-
Kirill Smelkov authored
The way BigArray.__getitem__ works for element access is that for e.g. A[i] it translates the request to A[i:i+1] and remembers to lower the dimensionality at scalar index dim_adjust = (0,) so, in full, A[i] is computed this way: A[i] -> A[i:i+1](0,) ( it is done this way to unify code for scalar / slice access in __getitem__ - see 0c826d5c "BigArray: An ndarray-like on top of BigFile memory mappings" ) The code for slice access also has a shortcut - if it sees that slice results in empty array (e.g. for out-of-bound slice), we can avoid spending time to create a file vma mapping only to create empty view on top of it. In 0c826d5c, that optimization, however forgot to apply the "lower the dimensionality" step on top of resulting empty view, and that turned out for not raising IndexError for out-of-bounds scalar access: A = BigArray((10,), uint8) In [1]: A[0] Out[1]: 0 In [2]: A[1] Out[2]: 0 In [3]: A[2] Out[3]: 0 In [4]: A[9] Out[4]: 0 In [5]: A[10] Out[5]: array([], dtype=uint8) NOTE that A[10] returns empty array instead of raising IndexError. So do not forget to apply the "reduce dimensionality" step for empty views, and this way we get proper IndexError (because for empty view, scalar access results in IndexError). NOTE: this bug was also preventing for e.g. list(A) to work, because list(A) internally works this way: l = [] i = iter(A) for _ in i: l.append(_) but iterating would not stop after 10 elements - after array end, _ will be always array([], dtype=uint8), and thus the loop never finished and memory usage grow to infinity. /cc @Tyagov
-
Kirill Smelkov authored
In NumPy speak advanced indexing is picking up arbitrarily requested elemtnts, e.g. a = arange(10) a[[0,3,2]] -> array([0, 3, 2]) The way this indexing schem works is - it creates a new array with len = len(key), and picks up requested elements sequentially into new area. So it is very not the same as creating _view_ to original array data by using basic indexing [1] BigArray does not support advanced indexing, because its main job is to organize an ndarray _view_ backed up by BigFile data and give that view to clients, and then it is up to clients how to use that view with full numpy api available with it. So be explicit, and reject advanced indexing in __getitem__ right at the beginning. [1] http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
-
- 20 May, 2015 2 commits
-
-
Kirill Smelkov authored
In NumPy, ndarray has .resize() but actually it does a whole array copy into newly allocated larger segment which makes e.g. appending O(n). For BigArray, we don't have that internal constraint NumPy has - to keep the array itself contiguously _stored_ (compare to contiguously _presented_ in memory). So we can have O(1) resize for big arrays. NOTE having O(1) resize, here is how O(δ) append can be done: A # ZBigArray e.g. of shape (N, 3) n = len(A) # lengh of A's major index =N A.resize((n+δ, A.shape[1:])) # add δ new entries ; now len(A) =N+δ A[-δ:] = <new-data> # set data for last new δ entries /cc @klaus
-
Kirill Smelkov authored
test_bigarray_indexing_Nd() contains useful class to have a BigFile connected to ndarray storage. Factor it out so that all tests could use it. BigFile_Data.storeblk() is newly introduced and is currently unused, but will be convenient to have later.
-
- 03 Apr, 2015 2 commits
-
-
Kirill Smelkov authored
This is like to BigArray, like ZBigFile is to BigFile (4174b84a "bigfile: BigFile backend to store data in ZODB")
-
Kirill Smelkov authored
I.e. something like numpy.memmap for numpy.ndarray and OS files. The whole bigarray cannot be used as a drop-in replacement for numpy arrays, but BigArray _slices_ are real ndarrays and can be used everywhere ndarray can be used, including in C/Fortran code. Slice size is limited by mapping-size (= address-space size) limit, i.e. to ~ max 127TB on Linux/amd64. Changes to bigarray memory are changes to bigfile memory mapping and as such can be discarded or saved back to bigfile using mapping (= BigFileH) dirty discard/writeout interface. For the same reason the whole amount of changes to memory is limited by amount of physical RAM.
-