1. 18 Dec, 2019 2 commits
  2. 12 Jul, 2019 2 commits
    • Kirill Smelkov's avatar
      *: Use defer for dbclose & friends · 5c8340d2
      Kirill Smelkov authored
      For tests this makes sure that if one test fails, it won't make following
      tests fail just because the next test will fail trying to lock test database.
      
      For regular code (demo_zbigarray.py) this is also a good thing to do -
      to always close the database irregardless of whether an exception was
      raised before program reached end of main.
      
      Pygolang becomes regular - not test only - dependency. Being regular
      dependency is currently required only by demo_zbigarray.py, but it will
      be also used in upcoming wcfs, so adding pygolang into wendelin.core
      dependencies aligns with the plan.
      
      dbclose now uses defer almost everywhere - there are still few places in
      tests, where one test function is opening/closing test database multiple
      times - those were not (yet ?) converted.
      5c8340d2
    • Kirill Smelkov's avatar
      */tests: Use pytest.raises in modern way · b12e319e
      Kirill Smelkov authored
      Instead of
      
      	raises(Exception, 'code')
      
      do
      
      	with raises(Exception):
      		code
      
      This removes lots of warnings, similar to below example:
      
      	bigfile/tests/test_basic.py::test_basic
      	  /home/kirr/src/wendelin/wendelin.core/bigfile/tests/test_basic.py:79: PytestDeprecationWarning: raises(..., 'code(as_a_string)') is deprecated, use the context manager form or use `exec()` directly
      
      	  See https://docs.pytest.org/en/latest/deprecations.html#raises-warns-exec
      	    raises(ROAttributeError, "f.blksize = 1") # RO attribute
      b12e319e
  3. 29 Oct, 2018 1 commit
  4. 12 Oct, 2018 1 commit
    • Kirill Smelkov's avatar
      bigarray: RAMArray · fc9b69d8
      Kirill Smelkov authored
      RAMArray is compatible to ZBigArray in API and semantic, but stores its
      data in RAM only. It is useful in situations where ZBigArray compatible
      data type is needed, but the amount of data is small and the data itself
      is needed only temporarily - e.g. in a simulation.
      
      Implementation is based on mmapping temporary files from /dev/shm/... and
      passing them as file handles, similarly to how ZBigArray works, to BigArray.
      We don't use just numpy.ndarray because of append - for ZBigArray append
      works in O(1), but more importantly it does not copy data. This way
      mmapings previously created for ZBigArray views, continue to correctly
      alias array data. If we would be using ndarray directly, since
      ndarray.resize copies data, that property would not be preserved.
      
      Original patch by Klaus Wölfel <klaus@nexedi.com>
      (nexedi/wendelin.core!8)
      fc9b69d8
  5. 11 Oct, 2018 1 commit
    • Kirill Smelkov's avatar
      bigarray/tests: Factor out a way to spcify on which BigFile/BigFileH an array... · 7365979b
      Kirill Smelkov authored
      bigarray/tests: Factor out a way to spcify on which BigFile/BigFileH an array is tested into fixture parameter
      
      Currently we have only one BigFile and its BigFileH handle. However in
      the next patch, for RAMArray, we'll be adding handles for opened RAM
      files, and it would be good to test whole BigArray functionality on
      data served by those handles too.
      
      Prepare for this and first factor out into testbig fixture the way to
      open such handles.
      7365979b
  6. 02 Apr, 2018 2 commits
    • Kirill Smelkov's avatar
      bigarray: ArrayRef support for BigArray · 450ad804
      Kirill Smelkov authored
      Rationale
      ---------
      
      Array reference could be useful in situations where one needs to pass arrays
      between processes and instead of copying array data, leverage the fact that
      top-level array, for example ZBigArray, is already persisted separately, and
      only send small amount of information referencing data in question.
      
      Implementation
      --------------
      
      BigArray is not regular NumPy array and so needs explicit support in
      ArrayRef code to find root object and indices. This patch adds such
      support via the following way:
      
      - when BigArray.__getitem__ creates VMA, it remembers in the VMA
        the top-level BigArray object under which this VMA was created.
      
      - when ArrayRef is finding root, it can detect such VMAs, because it will
        be pointed to by the most top regular ndarray's .base, and in turn gets
        top-level BigArray object from the VMA.
      
      - further all indices computations are performed, similarly to complete regular
        ndarrays case, on ndarrays root and a. But in the end .lo and .hi are
        adjusted for the corresponding offset of where root is inside whole
        BigArray.
      
      - there is no need to adjust .deref() at all.
      
      For remembering information into a VMA and also to be able to get
      (readonly) its mapping addresses _bigfile.c extension has to be extended
      a bit. Since we are now storing arbitrary python object attached to
      PyVMA - it can create cycles - and so PyVMA accordingly adjusted to
      support cyclic garbage collector.
      
      Please see the patch itself for more details and comments.
      450ad804
    • Kirill Smelkov's avatar
      bigarray: Add ArrayRef utility · d53371b6
      Kirill Smelkov authored
      ArrayRef is a tool to find out for a NumPy array its top-level root
      parent and remember instructions how to recreate original array from
      the root. For example if
      
      	root = arange(1E7)
      	z = root[1000:2000]
      	a = z[10:20]
      
      `ArrayRef(a)` will find out that the root array for `a` is `root` and
      that `a` occupies 1010:1020 bytes in it. The vice versa operation is
      also possible, for example given
      
      	aref = ArrayRef(a)
      
      it is possible to restore original `a` from `aref`:
      
      	a_ = aref.deref()
      	assert array_equal(a_, a)
      
      the restoration works without copying by creating appropriate view of
      root.
      
      ArrayRef should work reliably for arrays of arbitrary dimensions,
      strides etc - even fancy arrays created via stride tricks such as arrays
      whose elements overlap each other should be supported.
      
      This patch adds ArrayRef with support for regular ndarrays only.
      
      The next patch will add ArrayRef support for BigArray and description
      for ArrayRef rationale.
      d53371b6
  7. 24 Oct, 2017 1 commit
    • Kirill Smelkov's avatar
      Relicense to GPLv3+ with wide exception for all Free Software / Open Source... · f11386a4
      Kirill Smelkov authored
      Relicense to GPLv3+ with wide exception for all Free Software / Open Source projects + Business options.
      
      Nexedi stack is licensed under Free Software licenses with various exceptions
      that cover three business cases:
      
      - Free Software
      - Proprietary Software
      - Rebranding
      
      As long as one intends to develop Free Software based on Nexedi stack, no
      license cost is involved. Developing proprietary software based on Nexedi stack
      may require a proprietary exception license. Rebranding Nexedi stack is
      prohibited unless rebranding license is acquired.
      
      Through this licensing approach, Nexedi expects to encourage Free Software
      development without restrictions and at the same time create a framework for
      proprietary software to contribute to the long term sustainability of the
      Nexedi stack.
      
      Please see https://www.nexedi.com/licensing for details, rationale and options.
      f11386a4
  8. 16 Mar, 2017 1 commit
  9. 10 Mar, 2017 1 commit
    • Kirill Smelkov's avatar
      bigarray: Explicitly reject dtypes with object inside · e44bd761
      Kirill Smelkov authored
      From time to time people keep trying to use wendelin.core with
      dtype=object arrays and get segfaults without anything in logs or
      whatever else.
      
      Wendelin.core does not support it, because in case of dtype=object elements are
      really pointers and data for each object is stored in separate place in RAM
      with different per-object size.
      
      As we are memory-mapping arrays this won't work. It also does not
      essentially work for numpy.memmap for the same reason:
      
          (z4+numpy) kirr@mini:~/src/wendelin$ dd if=/dev/zero of=zero.dat bs=128 count=1
          1+0 records in
          1+0 records out
          128 bytes copied, 0.000209873 s, 610 kB/s
          (z4+numpy) kirr@mini:~/src/wendelin$ dd if=/dev/urandom of=random.dat bs=128 count=1
          1+0 records in
          1+0 records out
          128 bytes copied, 0.000225726 s, 567 kB/s
          (z4+numpy) kirr@mini:~/src/wendelin$ ipython
          ...
      
          In [1]: import numpy as np
      
          In [2]: np.memmap('zero.dat', dtype=np.object)
          Out[2]:
          memmap([None, None, None, None, None, None, None, None, None, None, None,
                 None, None, None, None, None], dtype=object)
      
          In [3]: np.memmap('random.dat', dtype=np.object)
          Out[3]: Segmentation fault
      
      So let's clarify this to users via explicitly raising exception when
      BigArray with non-appropriate dtype is trying to be created with
      descriptive explanation also logged.
      
      /reviewed-on nexedi/wendelin.core!4
      e44bd761
  10. 18 Dec, 2015 1 commit
    • Kirill Smelkov's avatar
      ZBigArray: Compatibility fix to read arrays from DB that were previously saved without order info · 2ca0f076
      Kirill Smelkov authored
      Commit ab9ca2df (bigarray: Add support for FORTRAN ordering) added
      ability to define array order, but there I made a mistake of not caring
      about how previously-saved to DB arrays would be read back.
      
      The thing is BigArray gained new data member ._order which is
      automatically saved to DB thanks to ZBigArray inheriting from
      Persistent; on load-from-db path we just read object state from DB,
      which for ZBigArray is dict, and restore object attributes from it.
      
      But for previously-saved data, obviously, there is no 'order' entry and thus
      this way restored objects are restored not in full to current code expectations
      and it can boom e.g. this way:
      
          zarray.resize((new_one,old_shape[1]))
        Module wendelin.bigarray, line 190, in resize
          self._init0(new_shape, self.dtype, order=self._order)
        AttributeError: 'ZBigArray' object has no attribute '_order'
      
      Solution to fix is: on restore-from-DB path, see if a data member is not
      present on restored object, and if it has default value in BigArray set it to
      that.
      
      ( code to get function defaults is from
        http://stackoverflow.com/questions/12627118/get-a-function-arguments-default-value )
      
      /cc @Tyagov, @klaus
      2ca0f076
  11. 15 Dec, 2015 1 commit
    • Kirill Smelkov's avatar
      bigfile/tests: move NotifyChannel to test_thread.py · 99cd1f03
      Kirill Smelkov authored
      NotifyChannel was introduced in c7c01ce4 (bigfile/zodb: ZODB.Connection
      can migrate between threads on close/open and we have to care) to test
      thread interaction specific to ZODB.
      
      We'll however need NotifyChannel to do more threading test of virtmem
      core, and this way the proper place for NotifyChannel is test_thread.py
      itself.
      
      Move it.
      99cd1f03
  12. 02 Nov, 2015 1 commit
  13. 21 Sep, 2015 2 commits
    • Kirill Smelkov's avatar
      bigarray: Fix __getitem__ for cases where element overlaps with edge between pages · e5b7c31b
      Kirill Smelkov authored
      When we serve indexing request, we first compute page range in backing
      file, which contains the result based on major index range, then mmap
      that file range and pick up result from there.
      
      Page range math was however not correct: e.g. for positive strides, last
      element's byte is (byte0_stop-1), NOT (byte0_stop - byte0_stride) which
      for cases where byte0_stop is just a bit after page boundary, can make a
      difference - page_max will be 1 page less what it should be and then
      whole ndarray view creation breaks:
      
          ...
          Module wendelin.bigarray, line 381, in __getitem__
            view0 = ndarray(view0_shape, self._dtype, vma0, view0_offset, view0_stridev)
        ValueError: strides is incompatible with shape of requested array and size of buffer
      
      ( because vma0 was created less in size than what is needed to create view0_shape
        shaped array starting from view0_offset in vma0. )
      
      Similar story for negative strides math - it was not correct neither.
      
      Fix it.
      
      /reported-by @Camata
      e5b7c31b
    • Kirill Smelkov's avatar
      bigarray/tests: Factor-out generic read-only BigFile_Data · 386ae339
      Kirill Smelkov authored
      We'll need this class in tests in the next patch.
      386ae339
  14. 02 Sep, 2015 1 commit
  15. 18 Aug, 2015 3 commits
  16. 17 Aug, 2015 1 commit
    • Kirill Smelkov's avatar
      bigfile: ZODB -> BigFileH invalidate propagation · 92bfd03e
      Kirill Smelkov authored
      Continuing theme from the previous patch, here is propagation of
      invalidation messages from ZODB to BigFileH memory.
      
      The use-case here is that e.g. one fileh mapping was created in one
      connection, another in another, and after doing changes in second
      connection and committing there, the first fileh has to invalidate
      appropriate already-loaded pages, so its next transaction won't work
      with stale data.
      
      To do it, we hook into ZBlk._p_invalidate() and propagate the
      invalidation message to ZBigFile which then notifies all
      opened-through-it ZBigFileH to invalidate a page.
      
      ZBlk -> ZBigFile lookup is done without storing backpointer in ZODB -
      instead, every time ZBigFile touches ZBlk object (and thus potentially
      does GHOST -> Live transition to it), we (re-)bind it back to ZBigFile.
      Since ZBigFile is the only class that works with ZBlk objects it is safe
      to do so.
      
      For ZBigFile to notify "all-opened-through-it" ZBigFileH, a weakset is
      introduced to track them.
      
      Otherwise the real page invalidation work is done by virtmem (see
      previous patch).
      92bfd03e
  17. 12 Aug, 2015 2 commits
    • Kirill Smelkov's avatar
      bigfile/zodb: ZODB.Connection can migrate between threads on close/open and we have to care · c7c01ce4
      Kirill Smelkov authored
      Intro
      -----
      
      ZODB maintains pool of opened-to-DB connections. For each request Zope
      opens 1 connection and, after request handling is done, returns the
      connection back to ZODB pool (via Connection.close()). The same
      connection will be opened again for handling some future next request at
      some future time. This next open can happen in different-from-first
      request worker thread.
      
      TransactionManager  (as accessed by transaction.{get,commit,abort,...})
      is thread-local, that is e.g. transaction.get() returns different
      transaction for threads T1 and T2.
      
      When _ZBigFileH hooks into txn_manager to get a chance to run its
      .beforeCompletion() when transaction.commit() is run, it hooks into
      _current_ _thread_ transaction manager.
      
      Without unhooking on connection close, and circumstances where
      connection migrates to different thread this can lead to
      dissynchronization between ZBigFileH managing fileh pages and Connection
      with ZODB objects. And even to data corruption, e.g.
      
          T1              T2
      
          open
          zarray[0] = 11
          commit
          close
      
                          open                # opens connection as closed in T1
          open
                          zarray[0] = 21
          commit
                          abort
      
          close           close
      
      Here zarray[0]=21 _will_ be committed by T1 as part of T1 transaction -
      because when T1 does commit .beforeCompletion() for zarray is invoked,
      sees there is dirty data and propagate changes to zodb objects in
      connection for T2, joins connection for T2 into txn for T1, and then txn
      for t1 when doing two-phase-commit stores modified objects to DB ->
      oops.
      
      ----------------------------------------
      
      To prevent such dissynchronization _ZBigFileH needs to be a DataManager
      which works in sync with the connection it was initially created under -
      on connection close, unregister from transaction_manager, and on
      connection open, register to transaction manager in current, possibly
      different, thread context. Then there won't be incorrect
      beforeCompletion() notification and corruption.
      
      This issue, besides possible data corruption, was probably also exposing
      itself via following ways we've seen in production (everywhere
      connection was migrated from T1 to T2):
      
      1. Exception ZODB.POSException.ConnectionStateError:
              ConnectionStateError('Cannot close a connection joined to a transaction',)
              in <bound method Cleanup.__del__ of <App.ZApplication.Cleanup instance at 0x7f10f4bab050>> ignored
      
           T1          T2
      
                       modify zarray
                       commit/abort    # does not join zarray to T2.txn,
                                       # because .beforeCompletion() is
                                       # registered in T1.txn_manager
      
           commit                      # T1 invokes .beforeCompletion()
           ...                         # beforeCompletion() joins ZBigFileH and zarray._p_jar (=T2.conn) to T1.txn
           ...                         # commit is going on in progress
           ...
           ...         close           # T2 thinks request handling is done and
           ...                         # and closes connection. But T2.conn is
           ...                         # still joined to T1.txn
      
      2. Traceback (most recent call last):
           File ".../wendelin/bigfile/file_zodb.py", line 121, in storeblk
             def storeblk(self, blk, buf):   return self.zself.storeblk(blk, buf)
           File ".../wendelin/bigfile/file_zodb.py", line 220, in storeblk
             zblk._v_blkdata = bytes(buf)    # FIXME does memcpy
           File ".../ZODB/Connection.py", line 857, in setstate
             raise ConnectionStateError(msg)
         ZODB.POSException.ConnectionStateError: Shouldn't load state for 0x1f23a5 when the connection is closed
      
         Similar to "1", but close in T2 happens sooner, so that when T1 does
         the commit and tries to store object to database, Connection refuses to
         do the store:
      
           T1          T2
      
                       modify zarray
                       commit/abort
      
           commit
           ...         close
           ...
           ...
           . obj.store()
           ...
           ...
      
      3. Traceback (most recent call last):
           File ".../wendelin/bigfile/file_zodb.py", line 121, in storeblk
             def storeblk(self, blk, buf):   return self.zself.storeblk(blk, buf)
           File ".../wendelin/bigfile/file_zodb.py", line 221, in storeblk
             zblk._p_changed = True          # if zblk was already in DB: _p_state -> CHANGED
           File ".../ZODB/Connection.py", line 979, in register
             self._register(obj)
           File ".../ZODB/Connection.py", line 989, in _register
             self.transaction_manager.get().join(self)
           File ".../transaction/_transaction.py", line 220, in join
             Status.ACTIVE, Status.DOOMED, self.status))
         ValueError: expected txn status 'Active' or 'Doomed', but it's 'Committing'
      
        ( storeblk() does zblk._p_changed -> Connection.register(zblk) ->
          txn.join() but txn is already committing
      
          IOW storeblk() was invoked with txn.state being already 'Committing' )
      
          T1          T2
      
                      modify obj      # this way T2.conn joins T2.txn
                      modify zarray
      
          commit                      # T1 invokes .beforeCompletion()
          ...                         # beforeCompletion() joins only _ZBigFileH to T1.txn
          ...                         # (because T2.conn is already marked as joined)
          ...
          ...         commit/abort    # T2 does commit/abort - this touches only T2.conn, not ZBigFileH
          ...                         # in particular T2.conn is now reset to be not joined
          ...
          . tpc_begin                 # actual active commit phase of T1 was somehow delayed a bit
          . tpc_commit                # when changes from RAM propagate to ZODB objects associated
          .  storeblk                 # connection (= T2.conn !) is notified again,
          .   zblk = ...              # wants to join txn for it thinks its transaction_manager,
                                      # which when called from under T1 returns *T1* transaction manager for
                                      # which T1.txn is already in state='Committing'
      
      4. Empty transaction committed to NEO
      
         ( different from doing just transaction.commit() without changing
           any data - a connection was joined to txn, but set of modified
           object turned out to be empty )
      
         This is probably a race in Connection._register when both T1 and T2
         go to it at the same time:
      
         https://github.com/zopefoundation/ZODB/blob/3.10/src/ZODB/Connection.py#L988
      
         def _register(self, obj=None):
              if self._needs_to_join:
                  self.transaction_manager.get().join(self)
                  self._needs_to_join = False
      
          T1                          T2
      
                                      modify zarray
          commit
          ...
          .beforeCompletion           modify obj
          . if T2.conn.needs_join      if T2.conn.needs_join      # race here
          .   T2.conn.join(T1.txn)       T2.conn.join(T2.txn)     # as a result T2.conn joins both T1.txn and T2.txn
          .
          commit finishes             # T2.conn registered-for-commit object list is now empty
      
                                      commit
                                       tpc_begin
                                        storage.tpc_begin
                                       tpc_commit
                                        # no object stored, because for-commit-list is empty
      
      /cc @jm, @klaus, @Tyagov, @vpelletier
      c7c01ce4
    • Kirill Smelkov's avatar
      bigarray/zodb: Forgot to close DB in tests · 070aeaa9
      Kirill Smelkov authored
      ( without dbclose, next test will not be able to open database - will
        timeout on open on waiting for FileStorage lock )
      070aeaa9
  18. 27 Jul, 2015 1 commit
    • Kirill Smelkov's avatar
      bigarray: In-place .append() · 1245acc9
      Kirill Smelkov authored
      ca064f75 (bigarray: Support resizing in-place) added O(1) in-place
      BigArray.resize() which makes possible for users to append data to BigArray in
      O(δ) time.
      
      But it is easy for people to make off-by-one mistakes when calculating
      indices for append.
      
      So provide a convenient BigArray.append() which simplifies the following
      
          A                               # ZBigArray e.g. of shape       (N, 3)
          values                          # ndarray to append of shape    (δ, 3)
          n, δ = len(A), len(values)      # length of A's major index  =N
          A.resize((n+δ, A.shape[1:]))    # add δ new entries ; now len(A) =N+δ
          A[-δ:] = values                 # set data for last new δ entries
      
      into
      
          A.append(values)
      
      /cc @klaus
      1245acc9
  19. 24 Jul, 2015 1 commit
  20. 26 Jun, 2015 2 commits
    • Kirill Smelkov's avatar
      bigarray: Fix flaky test in test_bigarray_indexing_1d · 9357bac8
      Kirill Smelkov authored
      We compare A_[10*PS-1] (which is A_[1]) to 0, but
      
          A_= ndarray ((10*PS,), uint8)
      
      and that means the array memory is not initialized. So the comparison
      works sometimes and sometimes it does not.
      
      Initialize compared element explicitly.
      
      NOTE: A (without _) element does not need to be initialized -
      because not-initialized BigArray parts read as zeros.
      9357bac8
    • Kirill Smelkov's avatar
      tests: Allow to test with ZEO & NEO ZODB storages · 7fc4ec66
      Kirill Smelkov authored
      Previously we were always testing with DBs backed up by FileStorage. Now
      we provide a way to run the testsuite with user selected storage
      backend:
      
          $ WENDELIN_CORE_TEST_DB="<fs>"   make test.py     # test with temporary db with FileStorage
          $ WENDELIN_CORE_TEST_DB="<zeo>"  make test.py     # ----------//---------- with ZEO
          $ WENDELIN_CORE_TEST_DB="<neo>"  make test.py     # ----------//---------- with NEO
      
          $ WENDELIN_CORE_TEST_DB=neo://db@master  make test.py     # test with externally provided DB
      
      Default is still to run tests with FileStorage.
      
      /cc @jm
      7fc4ec66
  21. 25 Jun, 2015 1 commit
    • Kirill Smelkov's avatar
      Move dbopen(), dbclose() to wendelin.lib.zodb · 72685306
      Kirill Smelkov authored
      Factor out those routines to open a ZODB database to common place.
      
      The reason for doing so is that we'll soon teach dbopen to automatically
      recognize several protocols, e.g. neo:// and zeo:// and this way,
      clients who use dbopen() could automatically access storages besides
      FileStorage.
      72685306
  22. 02 Jun, 2015 3 commits
    • Kirill Smelkov's avatar
      bigarray: Teach it how to automatically convert to ndarray (if enough address space is available) · 00db08d6
      Kirill Smelkov authored
      BigArrays can be big - up to 2^64 bytes, and thus in general it is not
      possible to represent whole BigArray as ndarray view, because address
      space is usually smaller on 64bit architectures.
      
      However users often try to pass BigArrays to numpy functions as-is, and
      numpy finds a way to convert, or start converting, BigArray to ndarray -
      via detecting it as a sequence, and extracting elements one-by-one.
      Which is slooooow.
      
      Because of the above, we provide users a well-defined service:
      - if virtual address space is available - we succeed at creating ndarray
        view for whole BigArray, without delay and copying.
      - if not - we report properly the error and give hint how BigArrays have
        to be processed in chunks.
      
      Verifying that big BigArrays cannot be converted to ndarray also tests
      for behaviour and issues fixed in last 5 patches.
      
      /cc @Tyagov
      /cc @klaus
      00db08d6
    • Kirill Smelkov's avatar
      *: It is not safe to use multiply.reduce() - it overflows · 73926487
      Kirill Smelkov authored
      e.g.
      
          In [1]: multiply.reduce((1<<30, 1<<30, 1<<30))
          Out[1]: 0
      
      instead of
      
          In [2]: (1<<30) * (1<<30) * (1<<30)
          Out[2]: 1237940039285380274899124224
      
          In [3]: 1<<90
          Out[3]: 1237940039285380274899124224
      
      also multiply.reduce returns int64, instead of python int:
      
          In [4]: type( multiply.reduce([1,2,3]) )
          Out[4]: numpy.int64
      
      which also leads to overflow-related problems if we further compute with
      this value and other integers and results exceeds int64 - it becomes
      float:
      
          In [5]: idx0_stop = 18446744073709551615
      
          In [6]: stride0   = numpy.int64(1)
      
          In [7]: byte0_stop = idx0_stop * stride0
      
          In [8]: byte0_stop
          Out[8]: 1.8446744073709552e+19
      
      and then it becomes a real problem for BigArray.__getitem__()
      
          wendelin.core/bigarray/__init__.py:326: RuntimeWarning: overflow encountered in long_scalars
            page0_min  = min(byte0_start, byte0_stop+byte0_stride) // pagesize # TODO -> fileh.pagesize
      
      and then
      
          >           vma0 = self._fileh.mmap(page0_min, page0_max-page0_min+1)
          E           TypeError: integer argument expected, got float
      
      ~~~~
      
      So just avoid multiple.reduce() and do our own mul() properly the same
      way sum() is builtin into python, and we avoid overflow-related
      problems.
      73926487
    • Kirill Smelkov's avatar
      bigarray: Translate OverflowError when computing slice indices to MemoryError · fcbb26e6
      Kirill Smelkov authored
      OverflowError when computing slice indices practically means we'll
      cannot allocate so much address space at next step:
      
          In [1]: s = slice(None)
      
          In [2]: s.indices(1<<62)
          Out[2]: (0, 4611686018427387904, 1)
      
          In [3]: s.indices(1<<63)
          ---------------------------------------------------------------------------
          OverflowError                             Traceback (most recent call last)
          <ipython-input-4-5aa549641bc6> in <module>()
          ----> 1 s.indices(1<<63)
      
          OverflowError: cannot fit 'long' into an index-sized integer
      
      So translate this OverflowError into MemoryError (preserving message
      details), because we'll need such "no so much address space" cases to
      show up as MemoryError in a sooner patch.
      fcbb26e6
  23. 28 May, 2015 3 commits
    • Kirill Smelkov's avatar
      bigarray: Test that asarray(BigArray(...)) does not hang · 0e25b01c
      Kirill Smelkov authored
      It was hanging with NumPy-1.9 before 425dc5d1 (bigarray: Raise
      IndexError for out-of-bound element access), because of the following
      correct NumPy commit:
      
          https://github.com/numpy/numpy/commit/d36f8227
      
      and in particular
      
          https://github.com/numpy/numpy/commit/d36f8227#diff-6d326badc0872de91e025cbfb0be1aafR522
      
      That PySequence_Fast(obj)    (with obj being BigArray)
      
      creates iterator on top of obj and before our previous IndexError fix in
      425dc5d1, this was looping forever.
      
      Test explicitly with both NumPy 1.8 and NumPy 1.9, that this construct
      does not hang.
      
      /cc @Tyagov
      0e25b01c
    • Kirill Smelkov's avatar
      bigarray: Raise IndexError for out-of-bound element access · 425dc5d1
      Kirill Smelkov authored
      The way BigArray.__getitem__ works for element access is that for e.g.
      
          A[i]
      
      it translates the request to
      
          A[i:i+1]
      
      and remembers to lower the dimensionality at scalar index
      
          dim_adjust = (0,)
      
      so, in full, A[i] is computed this way:
      
          A[i] -> A[i:i+1](0,)
      
      ( it is done this way to unify code for scalar / slice access in
        __getitem__ - see 0c826d5c "BigArray: An ndarray-like on top of
        BigFile memory mappings" )
      
      The code for slice access also has a shortcut - if it sees that slice
      results in empty array (e.g. for out-of-bound slice), we can avoid
      spending time to create a file vma mapping only to create empty view on
      top of it.
      
      In 0c826d5c, that optimization, however forgot to apply the "lower the
      dimensionality" step on top of resulting empty view, and that turned out
      for not raising IndexError for out-of-bounds scalar access:
      
          A = BigArray((10,), uint8)
          In [1]: A[0]
          Out[1]: 0
      
          In [2]: A[1]
          Out[2]: 0
      
          In [3]: A[2]
          Out[3]: 0
      
          In [4]: A[9]
          Out[4]: 0
      
          In [5]: A[10]
          Out[5]: array([], dtype=uint8)
      
      NOTE that A[10] returns empty array instead of raising IndexError.
      
      So do not forget to apply the "reduce dimensionality" step for empty
      views, and this way we get proper IndexError (because for empty view,
      scalar access results in IndexError).
      
      NOTE:
      
      this bug was also preventing for e.g.
      
          list(A)
      
      to work, because list(A) internally works this way:
      
          l = []
          i = iter(A)
          for _ in i:
              l.append(_)
      
      but iterating would not stop after 10 elements - after array end, _ will
      be always array([], dtype=uint8), and thus the loop never finished and
      memory usage grow to infinity.
      
      /cc @Tyagov
      425dc5d1
    • Kirill Smelkov's avatar
      bigarray: Be explicit about not-supporting advanced indexing · 4680c0cd
      Kirill Smelkov authored
      In NumPy speak advanced indexing is picking up arbitrarily requested
      elemtnts, e.g.
      
          a = arange(10)
          a[[0,3,2]]  -> array([0, 3, 2])
      
      The way this indexing schem works is - it creates a new array with
      len = len(key), and picks up requested elements sequentially into new
      area.
      
      So it is very not the same as creating _view_ to original array data by
      using basic indexing [1]
      
      BigArray does not support advanced indexing, because its main job is to
      organize an ndarray _view_ backed up by BigFile data and give that view
      to clients, and then it is up to clients how to use that view with full
      numpy api available with it.
      
      So be explicit, and reject advanced indexing in __getitem__ right at the
      beginning.
      
      [1] http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
      4680c0cd
  24. 20 May, 2015 3 commits
    • Kirill Smelkov's avatar
      bigarray: Support resizing in-place · ca064f75
      Kirill Smelkov authored
      In NumPy, ndarray has .resize() but actually it does a whole array
      copy into newly allocated larger segment which makes e.g. appending O(n).
      
      For BigArray, we don't have that internal constraint NumPy has - to
      keep the array itself contiguously _stored_ (compare to contiguously
      _presented_ in memory). So we can have O(1) resize for big arrays.
      
      NOTE having O(1) resize, here is how O(δ) append can be done:
      
          A                               # ZBigArray e.g. of shape   (N, 3)
          n = len(A)                      # lengh of A's major index  =N
          A.resize((n+δ, A.shape[1:]))    # add δ new entries ; now len(A) =N+δ
          A[-δ:] = <new-data>             # set data for last new δ entries
      
      /cc @klaus
      ca064f75
    • Kirill Smelkov's avatar
      bigarray/tests: Factor-out generic BigFile connected to numpy array · 929922fa
      Kirill Smelkov authored
      test_bigarray_indexing_Nd() contains useful class to have a BigFile
      connected to ndarray storage. Factor it out so that all tests could use
      it.
      
      BigFile_Data.storeblk() is newly introduced and is currently unused, but
      will be convenient to have later.
      929922fa
    • Kirill Smelkov's avatar
      bigarray: Fix typos · 3c7abddb
      Kirill Smelkov authored
      3c7abddb
  25. 03 Apr, 2015 2 commits
    • Kirill Smelkov's avatar
      ZBigArray: in-ZODB stored BigArray · 90d32e51
      Kirill Smelkov authored
      This is like to BigArray, like ZBigFile is to BigFile (4174b84a
      "bigfile: BigFile backend to store data in ZODB")
      90d32e51
    • Kirill Smelkov's avatar
      BigArray: An ndarray-like on top of BigFile memory mappings · 0c826d5c
      Kirill Smelkov authored
      I.e. something like numpy.memmap for numpy.ndarray and OS files. The whole
      bigarray cannot be used as a drop-in replacement for numpy arrays, but BigArray
      _slices_ are real ndarrays and can be used everywhere ndarray can be used,
      including in C/Fortran code. Slice size is limited by mapping-size (=
      address-space size) limit, i.e. to ~ max 127TB on Linux/amd64.
      
      Changes to bigarray memory are changes to bigfile memory mapping and as such
      can be discarded or saved back to bigfile using mapping (= BigFileH) dirty
      discard/writeout interface.
      
      For the same reason the whole amount of changes to memory is limited by amount
      of physical RAM.
      0c826d5c