1. 28 Oct, 2021 2 commits
    • Kirill Smelkov's avatar
      wcfs: client: Provide virtmem integration · 986cf86e
      Kirill Smelkov authored
      Provide integration with virtmem, so that WCFS Mapping can be associated
      and managed under virtmem VMA. In other words provide support so that WCFS can
      be used as ZBigFile backend in "mmap overlay" mode (see fae045cc "bigfile/virtmem:
      Introduce "mmap overlay" mode" for description of mmap-overlay mode).
      
      We'll need this functionality for ZBigFile + WCFS client integration.
      
      Virtmem integration will be tested via running whole wendelin.core functional
      testsuite in wcfs-mode after the next patch.
      
      Quoting added description:
      
      ---- 8< ----
      
      Integration with wendelin.core virtmem layer
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      This client package can be used standalone, but additionally provides
      integration with wendelin.core userspace virtual memory manager: when a
      Mapping is created, it can be associated as serving base layer for a
      particular virtmem VMA via FileH.mmap(vma=...). In that case, since virtmem
      itself adds another layer of dirty pages over read-only base provided by
      Mapping(+)
      
                       ┌──┐                      ┌──┐
                       │RW│                      │RW│    ← virtmem VMA dirty pages
                       └──┘                      └──┘
                                 +
                                                         VMA base = X@at view provided by Mapping:
      
                                                ___        /@revA/bigfile/X
              __                                           /@revB/bigfile/X
                     _                                     /@revC/bigfile/X
                                 +                         ...
           ───  ───── ──────────────────────────   ─────   /head/bigfile/X
      
      the Mapping will interact with virtmem layer to coordinate
      updates to mapping virtual memory.
      
      How it works
      ~~~~~~~~~~~~
      
      Wcfs client integrates with virtmem layer to support virtmem handle
      dirtying pages of read-only base-layer that wcfs client provides via
      isolated Mapping. For wcfs-backed bigfiles every virtmem VMA is interlinked
      with Mapping:
      
            VMA     -> BigFileH -> ZBigFile -----> Z
             ↑↓                                    O
           Mapping  -> FileH    -> wcfs server --> DB
      
      When a page is write-accessed, virtmem mmaps in a page of RAM in place of
      accessed virtual memory, copies base-layer content provided by Mapping into
      there, and marks that page as read-write.
      
      Upon receiving pin message, the pinner consults virtmem, whether
      corresponding page was already dirtied in virtmem's BigFileH (call to
      __fileh_page_isdirty), and if it was, the pinner does not remmap Mapping
      part to wcfs/@revX/f and just leaves dirty page in its place, remembering
      pin information in fileh._pinned.
      
      Once dirty pages are no longer needed (either after discard/abort or
      writeout/commit), virtmem asks wcfs client to remmap corresponding regions
      of Mapping in its place again via calls to Mapping.remmap_blk for previously
      dirtied blocks.
      
      The scheme outlined above does not need to split Mapping upon dirtying an
      inner page.
      
      See bigfile_ops interface (wendelin/bigfile/file.h) that explains base-layer
      and overlaying from virtmem point of view. For wcfs this interface is
      provided by small wcfs client wrapper in bigfile/file_zodb.cpp.
      
      (+) see bigfile_ops interface (wendelin/bigfile/file.h) that gives virtmem
          point of view on layering.
      
      ----------------------------------------
      
      Some preliminary history:
      
      kirr/wendelin.core@f330bd2f    X wcfs/client: Overview += interaction with virtmem layer
      986cf86e
    • Kirill Smelkov's avatar
      bigfile/virtmem: Introduce "mmap overlay" mode · fae045cc
      Kirill Smelkov authored
      with the intention to later use WCFS through it.
      
      Before this patch virtmem had only one mode: a BigFile backend was
      providing loadblk and storeblk methods, and on every block access
      loadblk was called to load block data into allocated RAM page.
      
      However with WCFS virtmem won't be needed to do anything to load data -
      because loading from head/bigfile/f mmaped through OS will be handled by
      OS directly. Thus for wcfs, that leaves virtmem only to handle dirtying
      and writeout.
      
      -> Introduce "mmap overlay" mode into virtmem to handle WCFS-like
      BigFile backends - that can provide read-only base layer suitable for
      mmapping.
      
      This patch is organized as follows:
      
      - fileh_open is added flags argument to indicate which mode to use for
        opened fileh. BigFileH is added .mmap_overlay bitfield correspondingly.
        (virtmem.h)
      
      - struct bigfile_ops is extended with 3 optional methods that a BigFile
        backend might provide to support mmap-overlay mode:
      
        * mmap_setup_read,
        * remmap_blk_read, and
        * munmap
      
        (see file.h changes for documentation of this new interface)
      
      - if opened with MMAP_OVERLAY flag, virtmem is using those methods to
        organize VMA views backed by read-only base mmap layer and writeout
        for such VMAs (virtmem.c)
      
      - a test is added to exercise MMAP_OVERLAY virtmem mode (test_virtmem.c)
      
      - everything else, including bigfile.py, is switched to use
        DONT_MMAP_OVERLAY unconditionally for now.
      
      In internal comments inside virtmem new mode is interchangeable called
      "mmap overlay" and "wcfs", even though wcfs is not hooked to be used
      mmap-overlaying yet.
      
      Some preliminary history:
      
      kirr/wendelin.core@fb6932a2    X Split PAGE_LOADED -> PAGE_LOADED, PAGE_LOADED_FOR_WRITE
      kirr/wendelin.core@4a20a573    X Settled on what should happen after writeout for wcfs case
      kirr/wendelin.core@f084ff9b    X Transition to all VMA under 1 fileh to be either all based on wcfs or all based on !wcfs
      fae045cc
  2. 15 Apr, 2020 1 commit
    • Kirill Smelkov's avatar
      bigfile/file.h: Cosmetics · 927458f6
      Kirill Smelkov authored
      - Provide brief top-level overview + refresh loadblk/storeblk/release comments.
      - Add `typedef struct bigfile_ops bigfile_ops` that we usually add for all structs.
      927458f6
  3. 04 Dec, 2019 2 commits
  4. 11 Jul, 2019 1 commit
  5. 09 Jul, 2019 2 commits
  6. 08 Jul, 2019 1 commit
    • Kirill Smelkov's avatar
      bigfile: RAM must be explicitly free'ed after close · f688a31d
      Kirill Smelkov authored
      Else on-heap allocated RAM object is leaked. Fixes e.g. the following
      error on ASAN:
      
      	Direct leak of 56 byte(s) in 1 object(s) allocated from:
      	    #0 0x7fc9ef390518 in calloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xe9518)
      	    #1 0x555ca792f309 in zalloc include/wendelin/utils.h:67
      	    #2 0x555ca7935f9a in ram_limited_new bigfile/tests/../../t/t_utils.c:35
      	    #3 0x555ca793a0ba in test_file_access_synthetic bigfile/tests/test_virtmem.c:292
      	    #4 0x555ca7967bc4 in main bigfile/tests/test_virtmem.c:1121
      	    #5 0x7fc9ef0e909a in __libc_start_main ../csu/libc-start.c:308
      f688a31d
  7. 24 Oct, 2017 1 commit
    • Kirill Smelkov's avatar
      Relicense to GPLv3+ with wide exception for all Free Software / Open Source... · f11386a4
      Kirill Smelkov authored
      Relicense to GPLv3+ with wide exception for all Free Software / Open Source projects + Business options.
      
      Nexedi stack is licensed under Free Software licenses with various exceptions
      that cover three business cases:
      
      - Free Software
      - Proprietary Software
      - Rebranding
      
      As long as one intends to develop Free Software based on Nexedi stack, no
      license cost is involved. Developing proprietary software based on Nexedi stack
      may require a proprietary exception license. Rebranding Nexedi stack is
      prohibited unless rebranding license is acquired.
      
      Through this licensing approach, Nexedi expects to encourage Free Software
      development without restrictions and at the same time create a framework for
      proprietary software to contribute to the long term sustainability of the
      Nexedi stack.
      
      Please see https://www.nexedi.com/licensing for details, rationale and options.
      f11386a4
  8. 10 Jan, 2017 2 commits
    • Kirill Smelkov's avatar
      bigfile/virtmem: Do storeblk() with virtmem lock released · fb4bfb32
      Kirill Smelkov authored
      Like with loadblk (see f49c11a3 "bigfile/virtmem: Do loadblk() with
      virtmem lock released" for the reference) storeblk() calls are
      potentially slow and external code that serves the call can take other
      locks in addition to virtmem lock taken by virtmem subsystem.
      If that "other locks" are also taken before external code calls e.g.
      with fileh_invalidate_page() in different codepath - a deadlock can happen:
      
            T1                  T2
      
            commit              invalidation-from-server received
            V -> storeblk
                                Z   <- ClientStorage.invalidateTransaction()
            Z -> zeo.store
                                V   <- fileh_invalidate_page (of unrelated page)
      
      The solution to avoid deadlock, like for loadblk case, is to call storeblk()
      with virtmem lock released.
      
      However unlike loadblk which can be invoked at any time, storeblk is
      invoked at commit time only so for storeblk case we handle rules for making
      sure virtmem stays consistent after virtmem lock is retaken differently:
      
      1. We disallow several parallel writeouts for one fileh. This way dirty
         pages handling logic can not mess up. This restriction is also
         consistent with ZODB 2 phase commit protocol where for a transaction
         commit logic is invoked/handled from only 1 thread.
      
      2. For the same reason we disallow discard while writeout is in
         progress. This is also consistent with ZODB 2 phase commit protocol
         where txn.tpc_abort() is not expected to be called at the same time
         with txn.commit().
      
      3. While writeout is in progress, for that fileh we disallow pages
         modifications and pages invalidations - because both operations would
         change at least fileh dirty pages list which is iterated over by
         writeout code with releasing/retaking the virtmem lock. By
         disallowing them we make sure fileh dirty pages list stays constant
         during whole fileh writeout.
      
         This restrictions are also consistent with ZODB commit semantics:
      
         - while an object is being stored into ZODB it is not expected it
           will be further modified or explicitly invalidated by client via
           ._p_invalidate()
      
         - server initiated invalidations come into effect only at transaction
           boundaries - when new transaction is started, not during commit time.
      
      Also since now storeblk is called with virtmem lock released, for buffer
      to store we no longer can use present page mapping in some vma directly,
      because while virtmem lock is released that mappings can go away.
      
      Fixes: #6
      fb4bfb32
    • Kirill Smelkov's avatar
      bigfile/virtmem: Maintain dirty pages list for a fileh · 8bb7f2f2
      Kirill Smelkov authored
      This allows writeout code not to scan whole pagemap to find dirty pages
      to write out, which should be faster.
      
      But more importantly iterating whole pagemap on writeout would become
      unsafe, when in upcoming patch storeblk() will be called with virt_lock
      released: because there pagemap could be modified e.g. due to processing
      other read accesses.
      
      So maintain fileh->dirty_pages list and use it when we need to go
      through dirtied pages.
      
      Updates: #6
      8bb7f2f2
  9. 24 Jun, 2016 1 commit
    • Kirill Smelkov's avatar
      bigfile/pagemap: Fix non-leaf page iteration · ee9bcd00
      Kirill Smelkov authored
      Since the beginning of pagemap (45af76e6 "bigfile/pagemap: specialized
      {} uint64 -> void * mapping") we had a bug sitting in
      __pagemap_for_each_leaftab() (non-leaf iterating logic behind
      pagemap_for_each):
      
      After entry to stack-down was found, we did not updated tailv[l]
      accordingly. Thus if there are non-adjacent entries an entry could be
      e.g. emitted many times:
      
           l 3  __down 0x7f79da1ee000
           tailv[4]: 0x7f79da1ee000
            -> tailv[4] 0x7f79da1ee000  __down 0x7f79da1ed000
      
           l 4  __down 0x7f79da1ed000
           tailv[5]: 0x7f79da1ed000
           h 5  l 5  leaftab: 0x7f79da1ed000      <--
            lvl 5  idx 169  page 0x55aa
          ok 9 - pagemap_for_each(0) == 21930
      
           l 5  __down (nil)
           tailv[4]: 0x7f79da1ee008
            -> tailv[4] 0x7f79da1ee008  __down 0x7f79da1ed000
      
           l 4  __down 0x7f79da1ed000
           tailv[5]: 0x7f79da1ed000
           h 5  l 5  leaftab: 0x7f79da1ed000      <--
            lvl 5  idx 169  page 0x55aa
          not ok 10 - pagemap_for_each(1) == 140724106500272
      
      And many-time-emitted entries are not only incorrect, but can also lead
      to not-handled segmentation faults in e.g. fileh_close():
      
          https://lab.nexedi.com/nexedi/wendelin.core/blob/v0.6-1-gb0b2c52/bigfile/virtmem.c#L179
      
          /* drop all pages (dirty or not) associated with this fileh */
          pagemap_for_each(page, &fileh->pagemap) {
              /* it's an error to close fileh to mapping of which an access is
               * currently being done in another thread */
              BUG_ON(page->state == PAGE_LOADING);
              page_drop_memory(page);
              list_del(&page->lru);                           <-- HERE
              bzero(page, sizeof(*page)); /* just in case */
              free(page);
          }
      
      ( because after first bzero of a page, the page is all 0 bytes including
        page->lru{.next,.prev} so on the second time when the same page is
        emitted by pagemap_for_each, list_del(&page->lru) will try to set
        page->lru.next = ... which will segfault. )
      
      So fix it by properly updating tailv[l] while we scan/iterate current level.
      
      NOTE
      
      This applies only to non-leaf pagemap levels, as leaf level is scanned
      with separate loop in pagemap_for_each. That's why we probably did not
      noticed this earlier - up until now our usual workloads was to change
      data in adjacent batches and that means adjacent pages.
      
      Though today @Tyagov was playing with wendelin.core in some other way and
      it uncovered the bug.
      ee9bcd00
  10. 15 Dec, 2015 1 commit
    • Kirill Smelkov's avatar
      bigfile/virtmem: Do loadblk() with virtmem lock released · f49c11a3
      Kirill Smelkov authored
      loadblk() calls are potentially slow and external code that serve the cal can
      take other locks in addition to virtmem lock taken by virtmem subsystem. If
      that "other locks" are also taken before external code calls e.g.
      fileh_invalidate_page() in different codepath a deadlock can happen, e.g.
      
            T1                  T2
      
            page-access         invalidation-from-server received
            V -> loadblk
                                Z   <- ClientStorage.invalidateTransaction()
            Z -> zeo.load
                                V   <- fileh_invalidate_page
      
      The solution to avoid deadlock is to call loadblk() with virtmem lock released
      and upon loadblk() completion recheck virtmem data structures carefully.
      
      To make that happen:
      
      - new page state is introduces:
      
          PAGE_LOADING                (file content loading is  in progress)
      
      - virtmem releases virt_lock before calling loadblk() when serving pagefault
      
      - because loading is now done with virtmem lock released, now:
      
      1. After loading completes we need to recheck fileh/vma data structures
      
         The recheck is done in full - vma_on_pagefault() just asks its driver (see
         VM_RETRY and VM_HANDLED codes) to retry handling the fault completely. This
         should work as the freshly loaded page was just inserted into fileh->pagemap
         and should be found there in the cache on next lookup.
      
         On the other hand this also works correctly, if there was concurrent change
         - e.g. vma was unmapped while we were loading the data - in that case the
         fault will be also processed correctly - but loaded data will stay in
         fileh->pagemap (and if not used will be evicted as not-needed
         eventually by RAM reclaim).
      
      2. Similar to retrying mechanism is used for cases when two threads
         concurrently access the same page and would both try to load corresponding
         block - only one thread issues the actual loadblk() and another waits for load
         to complete with polling and VM_RETRY.
      
      3. To correctly invalidate loading-in-progress pages another new page state
         is introduced:
      
          PAGE_LOADING_INVALIDATED    (file content loading was in progress
                                       while request to invalidate the page came in)
      
         which fileh_invalidate_page() uses to propagate invalidation message to
         loadblk() caller.
      
      4. Blocks loading can now happen in parallel with other block loading and
         other virtmem operations - e.g. invalidation. For such cases tests are added
         to test_thread.py
      
      5. virtmem lock now becomes just regular lock, instead of being previously
         recursive.
      
         For virtmem lock to be recursive was needed for cases, when code under
         loadblk() could trigger other virtmem calls, e.g. due to GC and calling
         another VMA dtor that would want to lock virtmem, but virtmem lock was
         already held.
      
         This is no longer needed.
      
      6. To catch double faults we now cannot use just on static variable
         in_on_pagefault. That variable thus becomes thread-local.
      
      7. Old test in test_thread to "test that access vs access don't overlap" no
         longer holds true - and is thus removed.
      
      /cc @Tyagov, @klaus
      f49c11a3
  11. 17 Aug, 2015 1 commit
    • Kirill Smelkov's avatar
      bigfile/virtmem: Client API to invalidate a fileh page · cb779c7b
      Kirill Smelkov authored
      FileH is a handle representing snapshot of a file. If, for a pgoffset,
      fileh already has loaded page, but we know the content of the file has
      changed externally after loading has been done, we need to propagate to
      fileh that such-and-such page should be invalidated (and reloaded on
      next access).
      
      This patch introduces
      
          fileh_invalidate_page(fileh, pgoffset)
      
      to do just that.
      
      In the next patch we'll use this facility to propagate invalidations of
      ZBlk ZODB objects to virtmem subsystem.
      
      NOTE
      
      Since invalidation removes "dirtiness" from a page state, several
      subsequent invalidations can make a fileh completely non-dirty
      (invalidating all dirty page). Previously fileh->dirty was just a one
      bit, so we needed to improve how we track dirtiness.
      
      One way would be to have a dirty list for fileh pages and operate on
      that. This has advantage to even optimize dirty pages processing like
      fileh_dirty_writeout() where we currently scan through all fileh pages
      just to write only PAGE_DIRTY ones.
      
      Another simpler way is to make fileh->dirty a counter and maintain that.
      
      Since we are going to move virtmem subsystem back into the kernel, here,
      a simpler less-intrusive approach is used.
      cb779c7b
  12. 06 Aug, 2015 1 commit
    • Kirill Smelkov's avatar
      bigfile/virtmem: Big Virtmem lock · d53271b9
      Kirill Smelkov authored
      At present several threads running can corrupt internal virtmem
      datastructures (e.g. ram->lru_list, fileh->pagemap, etc).
      
      This can happen even if we have zope instances only with 1 worker thread
      - because there are other "system" thread, and python garbage collection
      can trigger at any thread, so if a virtmem object, e.g. VMA or FileH was
      there sitting at GC queue to be collected, their collection, and thus
      e.g. vma_unmap() and fileh_close() will be called from
      different-from-worker thread.
      
      Because of that virtmem just has to be aware of threads not to allow
      internal datastructure corruption.
      
      On the other hand, the idea of introducing userspace virtual memory
      manager turned out to be not so good from performance and complexity
      point of view, and thus the plan is to try to move it back into the
      kernel. This way it does not make sense to do a well-optimised locking
      implementation for userspace version.
      
      So we do just a simple single "protect-all" big lock for virtmem.
      
      Of a particular note is interaction with Python's GIL - any long-lived
      lock has to be taken with GIL released, because else it can deadlock:
      
          t1  t2
      
          G
          V   G
         !G   V
          G
      
      so we introduce helpers to make sure the GIL is not taken, and to retake
      it back if we were holding it initially.
      
      Those helpers (py_gil_ensure_unlocked / py_gil_retake_if_waslocked) are
      symmetrical opposites to what Python provides to make sure the GIL is
      locked (via PyGILState_Ensure / PyGILState_Release).
      
      Otherwise, the patch is more-or-less straightforward application for
      one-big-lock to protect everything idea.
      d53271b9
  13. 03 Apr, 2015 6 commits
    • Kirill Smelkov's avatar
      bigfile/virtmem: Userspace Virtual Memory Manager · 9a293c2d
      Kirill Smelkov authored
      Does similar things to what kernel does - users can mmap file parts into
      address space and access them read/write. The manager will be getting
      invoked by hardware/OS kernel for cases when there is no page loaded for
      read, or when a previousle read-only page is being written to.
      
      Additionally to features provided in kernel, it support to be used to
      store back changes in transactional way (see fileh_dirty_writeout()) and
      potentially use huge pages for mappings (though this is currently TODO)
      9a293c2d
    • Kirill Smelkov's avatar
      bigfile/file.h: C interface for defining BigFile backends · 9065e2b9
      Kirill Smelkov authored
      Users can inherit from BigFile and provide custom ->loadblk() and
      ->storeblk() to load/store file blocks from a database or some other
      storage. The system then could use such files to memory map them into
      user address space (see next patch).
      9065e2b9
    • Kirill Smelkov's avatar
      bigfile: RAM subsystem · 8c935a5f
      Kirill Smelkov authored
      This thing allows to get aliasable RAM from OS kernel and to manage it.
      Currently we get memory from a tmpfs mount, and hugetlbfs should also
      work, but is TODO because hugetlbfs in the kernel needs to be improved.
      
      We need aliasing because we'll need to be able to memory map the same
      page into several places in address space, e.g. for taking two slices
      overlapping slice of the same array at different times.
      
      Comes with test programs that show we aliasing does not work for
      anonymous memory.
      8c935a5f
    • Kirill Smelkov's avatar
      bigfile: Stub for virtmem · 77d61533
      Kirill Smelkov authored
      This will be the core of virtual memory subsystem. For now we just
      define a structure to describe pages of memory and add utility to
      allocate address space from OS.
      77d61533
    • Kirill Smelkov's avatar
      bigfile/pagemap: specialized {} uint64 -> void * mapping · 45af76e6
      Kirill Smelkov authored
      For BigFiles we'll needs to maintain `{} offset-in-file -> void *` mapping. A
      hash or a binary tree could be used there, but since we know files are
      most of the time accessed sequentially and locally in pages-batches, we
      can also organize the mapping in batches of keys.
      
      Specifically offset bits are so divided into parts, that every part
      addresses 1 entry in a table of hardware-page in size. To get to the actual
      value, the system lookups first table by first part of offset, then from
      first table and next part from address - second table, etc.
      
      To clients this looks like a dictionary with get/set/del & clear methods,
      but lookups are O(1) time always, and in contrast to hashes values are
      stored with locality (= adjacent lookups almost always access the same tables).
      45af76e6
    • Kirill Smelkov's avatar
      bigfile/types.h: Type's we'll use · 8114ad6c
      Kirill Smelkov authored
      8114ad6c