• Kirill Smelkov's avatar
    Port/move channels to C/C++/Pyx · 3b241983
    Kirill Smelkov authored
    - Move channels implementation to be done in C++ inside libgolang. The
      code and logic is based on previous Python-level channels
      implementation, but the new code is just C++ and does not depend on
      Python nor GIL at all, and so works without GIL if libgolang
      runtime works without GIL(*).
    
      (*) for example "thread" runtime works without GIL, while "gevent" runtime
          acquires GIL on every semaphore acquire.
    
      New channels implementation is located in δ(libgolang.cpp).
    
    - Provide low-level C channels API to the implementation. The low-level
      C API was inspired by Libtask[1] and Plan9/Libthread[2].
    
      [1] Libtask: a Coroutine Library for C and Unix. https://swtch.com/libtask.
      [2] http://9p.io/magic/man2html/2/thread.
    
    - Provide high-level C++ channels API that provides type-safety and
      automatic channel lifetime management.
    
      Overview of C and C++ APIs are in δ(libgolang.h).
    
    - Expose C++ channels API at Pyx level as Cython/nogil API so that Cython
      programs could use channels with ease and without need to care about
      lifetime management and low-level details.
    
      Overview of Cython/nogil channels API is in δ(README.rst) and
      δ(_golang.pxd).
    
    - Turn Python channels to be tiny wrapper around chan<PyObject>.
    
    Implementation note:
    
    - gevent case needs special care because greenlet, which gevent uses,
      swaps coroutine stack from C stack to heap on coroutine park, and
      replaces that space on C stack with stack of activated coroutine
      copied back from heap. This way if an object on g's stack is accessed
      while g is parked it would be memory of another g's stack.
    
      The channels implementation explicitly cares about this issue so that
      stack -> * channel send, or * -> stack channel receive work correctly.
    
      It should be noted that greenlet approach, which it inherits from
      stackless, is not only a bit tricky, but also comes with overhead
      (stack <-> heap copy), and prevents a coroutine to migrate from 1 OS
      thread to another OS thread as that would change addresses of on-stack
      things for that coroutine.
    
      As the latter property prevents to use multiple CPUs even if the
      program / runtime are prepared to work without GIL, it would be more
      logical to change gevent/greenlet to use separate stack for each
      coroutine. That would remove stack <-> heap copy and the need for
      special care in channels implementation for stack - stack sends.
      Such approach should be possible to implement with e.g. swapcontext or
      similar mechanism, and a proof of concept of such work wrapped into
      greenlet-compatible API exists[3]. It would be good if at some point
      there would be a chance to explore such approach in Pygolang context.
    
      [3] https://github.com/python-greenlet/greenlet/issues/113#issuecomment-264529838 and below
    
    Just this patch brings in the following speedup at Python level:
    
     (on i7@2.6GHz)
    
    thread runtime:
    
        name             old time/op  new time/op  delta
        go               20.0µs ± 1%  15.6µs ± 1%  -21.84%  (p=0.000 n=10+10)
        chan             9.37µs ± 4%  2.89µs ± 6%  -69.12%  (p=0.000 n=10+10)
        select           20.2µs ± 4%   3.4µs ± 5%  -83.20%  (p=0.000 n=8+10)
        def              58.0ns ± 0%  60.0ns ± 0%   +3.45%  (p=0.000 n=8+10)
        func_def         43.8µs ± 1%  43.9µs ± 1%     ~     (p=0.796 n=10+10)
        call             62.4ns ± 1%  63.5ns ± 1%   +1.76%  (p=0.001 n=10+10)
        func_call        1.06µs ± 1%  1.05µs ± 1%   -0.63%  (p=0.002 n=10+10)
        try_finally       136ns ± 0%   137ns ± 0%   +0.74%  (p=0.000 n=9+10)
        defer            2.28µs ± 1%  2.33µs ± 1%   +2.34%  (p=0.000 n=10+10)
        workgroup_empty  48.2µs ± 1%  34.1µs ± 2%  -29.18%  (p=0.000 n=9+10)
        workgroup_raise  58.9µs ± 1%  45.5µs ± 1%  -22.74%  (p=0.000 n=10+10)
    
    gevent runtime:
    
        name             old time/op  new time/op  delta
        go               24.7µs ± 1%  15.9µs ± 1%  -35.72%  (p=0.000 n=9+9)
        chan             11.6µs ± 1%   7.3µs ± 1%  -36.74%  (p=0.000 n=10+10)
        select           22.5µs ± 1%  10.4µs ± 1%  -53.73%  (p=0.000 n=10+10)
        def              55.0ns ± 0%  55.0ns ± 0%     ~     (all equal)
        func_def         43.6µs ± 1%  43.6µs ± 1%     ~     (p=0.684 n=10+10)
        call             63.0ns ± 0%  64.0ns ± 0%   +1.59%  (p=0.000 n=10+10)
        func_call        1.06µs ± 1%  1.07µs ± 1%   +0.45%  (p=0.045 n=10+9)
        try_finally       135ns ± 0%   137ns ± 0%   +1.48%  (p=0.000 n=10+10)
        defer            2.31µs ± 1%  2.33µs ± 1%   +0.89%  (p=0.000 n=10+10)
        workgroup_empty  70.2µs ± 0%  55.8µs ± 0%  -20.63%  (p=0.000 n=10+10)
        workgroup_raise  90.3µs ± 0%  70.9µs ± 1%  -21.51%  (p=0.000 n=9+10)
    
    The whole Cython/nogil work - starting from 8fa3c15b (Start using Cython
    and providing Cython/nogil API) to this patch - brings in the following
    speedup at Python level:
    
     (on i7@2.6GHz)
    
    thread runtime:
    
        name             old time/op  new time/op  delta
        go               92.9µs ± 1%  15.6µs ± 1%  -83.16%  (p=0.000 n=10+10)
        chan             13.9µs ± 1%   2.9µs ± 6%  -79.14%  (p=0.000 n=10+10)
        select           29.7µs ± 6%   3.4µs ± 5%  -88.55%  (p=0.000 n=10+10)
        def              57.0ns ± 0%  60.0ns ± 0%   +5.26%  (p=0.000 n=10+10)
        func_def         44.0µs ± 1%  43.9µs ± 1%     ~     (p=0.055 n=10+10)
        call             63.5ns ± 1%  63.5ns ± 1%     ~     (p=1.000 n=10+10)
        func_call        1.06µs ± 0%  1.05µs ± 1%   -1.31%  (p=0.000 n=10+10)
        try_finally       139ns ± 0%   137ns ± 0%   -1.44%  (p=0.000 n=10+10)
        defer            2.36µs ± 1%  2.33µs ± 1%   -1.26%  (p=0.000 n=10+10)
        workgroup_empty  98.4µs ± 1%  34.1µs ± 2%  -65.32%  (p=0.000 n=10+10)
        workgroup_raise   135µs ± 1%    46µs ± 1%  -66.35%  (p=0.000 n=10+10)
    
    gevent runtime:
    
        name             old time/op  new time/op  delta
        go               68.8µs ± 1%  15.9µs ± 1%  -76.91%  (p=0.000 n=10+9)
        chan             14.8µs ± 1%   7.3µs ± 1%  -50.67%  (p=0.000 n=10+10)
        select           32.0µs ± 0%  10.4µs ± 1%  -67.57%  (p=0.000 n=10+10)
        def              58.0ns ± 0%  55.0ns ± 0%   -5.17%  (p=0.000 n=10+10)
        func_def         43.9µs ± 1%  43.6µs ± 1%   -0.53%  (p=0.035 n=10+10)
        call             63.5ns ± 1%  64.0ns ± 0%   +0.79%  (p=0.033 n=10+10)
        func_call        1.08µs ± 1%  1.07µs ± 1%   -1.74%  (p=0.000 n=10+9)
        try_finally       142ns ± 0%   137ns ± 0%   -3.52%  (p=0.000 n=10+10)
        defer            2.32µs ± 1%  2.33µs ± 1%   +0.71%  (p=0.005 n=10+10)
        workgroup_empty  90.3µs ± 0%  55.8µs ± 0%  -38.26%  (p=0.000 n=10+10)
        workgroup_raise   108µs ± 1%    71µs ± 1%  -34.64%  (p=0.000 n=10+10)
    
    This patch is the final patch in series to reach the goal of providing
    channels that could be used in Cython/nogil code.
    
    Cython/nogil channels work is dedicated to the memory of Вера Павловна Супрун[4].
    
    [4] https://navytux.spb.ru/memory/%D0%A2%D1%91%D1%82%D1%8F%20%D0%92%D0%B5%D1%80%D0%B0.pdf#page=3
    3b241983
_golang.pxd 3.72 KB