1. 12 Mar, 2025 1 commit
  2. 24 Feb, 2025 1 commit
    • Kirill Smelkov's avatar
      gpython: Fix how --long-option=value=something is processed · 6aec4784
      Kirill Smelkov authored
      Carlos reports that when pymain is invoked as
          > python --login='uid=username,ou=people,dc=something,dc=de' my_script.py
      it crashes as
          Traceback (most recent call last):
            File "/srv/slapgrid/slappart5/software_release/bin/python", line 312, in <module>
            File "/opt/slapgrid/133edb8b6bfc135bce30900e2b50555e/parts/pygolang/gpython/__init__.py", line 113, in pymain
              for (opt, arg) in igetopt:
            File "/opt/slapgrid/133edb8b6bfc135bce30900e2b50555e/parts/pygolang/gpython/__init__.py", line 552, in __next__
              opt, arg = opt.split('=')
      While @jerome correctly notices that the problem here is due to --login
      is passed to python instead of my_script.py it still highlights a
      problem on gpython side in its _IGetOpt parser for which I made a thinko
      in 26058b5b (gpython: Factor-out options parsing into getopt-style
      _IGetOpt helper) without considering that a value for
      --long-option=value could itself contain another '=' symbols.
      -> Fix this thinko.
      Without the fix gpython --unknown=x=y crashes as
          Traceback (most recent call last):
            File "/home/kirr/src/wendelin/venv/py39.venv/bin/gpython", line 8, in <module>
            File "/home/kirr/src/tools/go/pygolang-master/gpython/__init__.py", line 402, in main
              for (opt, arg) in igetopt:
            File "/home/kirr/src/tools/go/pygolang-master/gpython/__init__.py", line 562, in __next__
              opt, arg = opt.split('=')
          ValueError: too many values to unpack (expected 2)
      but after the fix it reports more user-friendly
          RuntimeError: unexpected option --unknown
      /reported-and-reviewed-by @vnmabus
      /reported-on https://lab.nexedi.com/nexedi/pygolang/-/issues/1
      /reviewed-on nexedi/pygolang!32
  3. 20 Feb, 2025 2 commits
    • Kirill Smelkov's avatar
      Uniform UTF8-based approach to strings · 50b3808c
      Kirill Smelkov authored
      Context: together with Jérome we've been struggling with porting Zodbtools to
      Python3 for several years. Despite several incremental attempts[1,2,3]
      we are not there yet with the main difficulty being backward compatibility breakage
      that Python3 did for bytes and unicode. During my last trial this spring, after
      I've tried once again to finish this porting and could not reach satisfactory
      result, I've finally decided to do something about this at the root of the
      cause: at the level of strings - where backward compatibility was broken - with
      the idea to fix everything once and for all.
      In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost
      overview"[5] Jean-Paul highlighted the problem of strings backward
      compatibility breakage, that Python 3 did, as the major one.
      In 2019 we had some conversations with Jérome about this topic as well[6,7].
      In 2020 I've started to approach it with `b` and `u` that provide
      always-working conversion in between bytes and unicode[8], and via limited
      usage of custom bytes- and unicode- like types that are interoperable with both
      bytes and unicode simultaneously[9].
      Today, with this work, I'm finally exposing those types for general usage, so
      that bytes/unicode problem could be handled automatically. The overview of the
      functionality is provided below:
      ---- 8< ----
      Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
      the idea to make working with byte- and unicode- strings easy and transparently
      - `bstr` is byte-string: it is based on `bytes` and can automatically convert to/from `unicode` (*).
      - `ustr` is unicode-string: it is based on `unicode` and can automatically convert to/from `bytes`.
      The conversion, in both encoding and decoding, never fails and never looses
      information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
      even if bytes data is not valid UTF-8.
      Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.
      Semantically `bstr` is array of bytes, while `ustr` is array of
      unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
      unicode character correspondingly (+). However it is possible to yield unicode
      character when iterating `bstr` via `uiter`, and to yield byte character when
      iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of the
      time, and `ustr` only needs to be used for random access to string characters.
      See [Strings, bytes, runes and characters in Go](https://blog.golang.org/strings) for overview of this approach.
      Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while
      operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce
      to `ustr`.  When the coercion happens, `bytes` and `bytearray`, similarly to
      `bstr`, are also treated as UTF8-encoded strings.
      `bstr` and `ustr` are meant to be drop-in replacements for standard
      `str`/`unicode` classes. They support all methods of `str`/`unicode` and in
      particular their constructors accept arbitrary objects and either convert or stringify them. For
      cases when no stringification is desired, and one only wants to convert
      `bstr`/`ustr` / `unicode`/`bytes`/`bytearray`, or an object with `buffer`
      interface (%), to Pygolang string, `b` and `u` provide way to make sure an
      object is either `bstr` or `ustr` correspondingly.
      Usage example:
         s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.
         s += ' мир'          # s is b('привет мир')
         for c in uiter(s):   # c will iterate through
              ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
         # the following gives b('привет мир труд май')
         b('привет %s %s %s') % (u'мир',                  # raw unicode
                                 u'труд'.encode('utf-8'), # raw bytes
                                 u('май'))                # ustr
         def f(s):
            s = u(s)          # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer.
            ...               # (^) the decoding never fails nor looses information.
      (*) `unicode` on Python2, `str` on Python3.
      (+) ordinal of such byte and unicode character can be obtained via regular `ord`.
          For completeness `bbyte` and `uchr` are also provided for constructing 1-byte `bstr` and 1-character `ustr` from ordinal.
      (%) data in buffer, similarly to `bytes` and `bytearray`, is treated as UTF8-encoded string.
          Notice that only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not.
      ---- 8< ----
      With this e.g. zodbtools is finally ported to Python3 easily[10].
      One note is that we change `b` and `u` to return `bstr`/`ustr` instead of
      `bytes`/`unicode`. This is change in behaviour, but I hope it won't break
      anything. The reason for this is that now-returned `bstr` and `ustr` are meant
      to be drop-in replacements for standard string types, and that there are not
      many existing `b` and `u` users. We just need to make sure that the places,
      that already use `b` and `u` continue to work. Those include Zodbtools,
      Nxdtest[11], and lonet[12], which should continue to work ok.
      @klaus, you once said that you use `b` and `u` somewhere as well. Please do not
      hesitate to let me know if this change causes any issues for you, and we will,
      hopefully, try to find a solution.
      /cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya
      /reviewed-and-discussed-on !21
      [1] zodbtools!12
      [2] zodbtools!13
      [3] zodbtools!16
      [4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
      [5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
      [6] zodbtools!8 (comment 73726)
      [7] zodbtools!13 (comment 81646)
      [8] bcb95cd5
      [9] edc7aaab
      [10] zodbtools@9861c136
      [11] https://lab.nexedi.com/nexedi/nxdtest
      [12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py
    • Kirill Smelkov's avatar
      golang: Fix import on Windows · f59a785d
      Kirill Smelkov authored
      We already fixed DSO path setup for Windows in a5ce8175 (golang: Prepare
      path for libgolang.dll before importing _golang), but after 68f384a9 (*:
      Replace imp with importlib on py3) it got broken again because
      golang._gopath started to import golang.sync and import of
      golang._gopath was before dylink_prepare_dso call:
          (1.wenv-386) Z:\home\kirr\src\tools\go\pygo-win\pygolang>python
          Python 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:21:31) [MSC v.1936 32 bit (Intel)] on win32
          Type "help", "copyright", "credits" or "license" for more information.
          >>> import golang
          00f8:err:module:import_dll Library libgolang.dll (which is needed by L"Z:\\home\\kirr\\src\\tools\\go\\pygo-win\\pygolang\\golang\\_sync.cp311-win32.pyd") not found
          00f8:err:module:import_dll Library libpyxruntime.dll (which is needed by L"Z:\\home\\kirr\\src\\tools\\go\\pygo-win\\pygolang\\golang\\_sync.cp311-win32.pyd") not found
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "Z:\home\kirr\src\tools\go\pygo-win\pygolang\golang\__init__.py", line 41, in <module>
              from golang._gopath import gimport  # make gimport available from golang
            File "Z:\home\kirr\src\tools\go\pygo-win\pygolang\golang\_gopath.py", line 65, in <module>
              from golang import sync
            File "Z:\home\kirr\src\tools\go\pygo-win\pygolang\golang\sync.py", line 36, in <module>
              from golang._sync import \
          ImportError: DLL load failed while importing _sync: Модуль не найден.
      -> Fix that by doing dylink_prepare_dso('golang.runtime.libgolang')
      before importing anything else.
  4. 19 Feb, 2025 4 commits
    • Kirill Smelkov's avatar
      golang_str: def -> cdef for UTF-8b decode/encode routines · 5bf08f8b
      Kirill Smelkov authored
      We use them only privately and only from pyx files.
    • Kirill Smelkov's avatar
      fixup! golang_str_pickle: Fix it so that py3 can load what py2 saved and back · 0cc53fdd
      Kirill Smelkov authored
      Fix typo made in 1ec5ed82: in one place the comment was saying we are
      doing bstr(BINUNICODE) while in the pickle stream it was ustr() call.
    • Kirill Smelkov's avatar
      golang_str_pickle: Fix bstr to pickle/unpickle in forward-compatible way wrt upcoming UTF-8bk · 5aa1de72
      Kirill Smelkov authored
      In 1ec5ed82 (golang_str_pickle: Fix it so that py3 can load what py2
      saved and back) we changed how bstr and ustr are pickled so that the
      pickling process is explicit and that both py2/py3 can load what any of
      py2/py3 saved. It all works ok for that.
      However for protocol < 3 bstr is pickled via unicode data, with
      instructions to unpickle it as bstr(unicode-data). The idea is generally ok,
      but taking into account planned introduction of UTF-8bk (see c0a53847
      "golang_str: TODO UTF-8bk" for details), it might result in bstr data
      saved before UTF-8b -> UTF-8bk switch, to become loaded in corrupt form
      after the switch.
      -> Care to avoid that by explicitly instructing pickle stream to always
      load data saved before the switch to UTF-8bk, as UTF-8b.
    • Kirill Smelkov's avatar
      golang_str: Revert adding buffer interface to ustr · 9ef32517
      Kirill Smelkov authored
      Testing this change on upcoming gpython/py3 with str patched to be ustr
      revealed compatibility breakage against several places in standard
      library. One example of such a breakage is os.listdir, which after
      doing PyObject_CheckBuffer decides to return bytes instead of unicode in
      the result:
      which makes e.g. pytest to fail to work with
          $ gpython -m pytest -vsx
            File ".../lib/python3.11/pathlib.py", line 370, in _select_from
              if self.match(name):
          TypeError: cannot use a string pattern on a bytes-like object
      This was immediately-seen breakage even without trying to run ERP5 on
      top of gpy3. So in general adding buffer interface to ustr is believed to
      break too much compatibility with standard unicode on py3 that we
      decided against it.
      -> Revert adding buffer interface to ustr.
      This effectively reverts 8a240b5b (golang_str: Fix ustr to provide
      buffer interface, like bstr already does), but leaves added/updated
      tests and comments there about why making memoryview(ustr) turned out to
      be not a good idea.
  5. 22 Dec, 2024 2 commits
    • Kirill Smelkov's avatar
      golang_str: Fix ustr to provide buffer interface, like bstr already does · 8a240b5b
      Kirill Smelkov authored
      Kazuhiko reports that using base64.b64encode with ustr fails on py3:
          >>> base64.b64encode(b('a'))
          >>> base64.b64encode(u('a'))
          Traceback (most recent call last):
            File "<console>", line 1, in <module>
            File "/*/lib/python3.8/base64.py", line 58, in b64encode
              encoded = binascii.b2a_base64(s, newline=False)
          TypeError: a bytes-like object is required, not 'pyustr'
      which uncovers a thought bug of mine: initially in 105d03d4 (golang_str: Add
      test for memoryview(bstr)) I made only bstr to provide buffer interface, while
      ustr does not provide it with wrong-thinking that it contains unicode
      characters, not binary data. But to fully respect the promise that ustr can be
      automatically converted to bytes, it also means that ustr should provide buffer
      interface so that things like PyArg_Parse("s#") or PyArg_Parse("y") could
      accept it.
      While PyArg_Parse("s#") is not yet completely fixed to work with this patch, as
      it still reports UnicodeEncodeError for ustr corresponding to non-UTF8 data,
      adding buffer interface to ustr is still a step into the right direction
      becuase of the way e.g. binascii.b64encode(u) is implemented:
          base64.b64encode(x)     ->  binascii.b2a_base64(x)
          binascii.b2a_base64(u)  ->  py2: PyArg_ParseTuple('s*', u)  ->  _PyUnicode_AsDefaultEncodedString(u)
                                      py3: PyObject_GetBuffer(u)      ->  u.tp_as_buffer.bf_getbuffer
      Here we see that on py3 it tails to retrieve object's data via
      .tp_as_buffer.bf_getbuffer and if there is no buffer interface provided that
      will fail. But we can't let base64.b64encode(ustr) to fail if
      base64.b64encode(bstr) works ok because both bstr and ustr represent the
      same string entity just into two different forms.
      -> So teach ustr to provide buffer interface so that e.g. memoryview starts to
         work on it and observe corresponding bytes data. This fixes
         binascii.b64encode(ustr) on py3 and also fixes t_hash/py2, and y, y_star and
         y_hash test_strings_capi_getargs_to_cstr cases on py3.
      Note: the original unicode on py2 has:
          .bf_getreadbuf      -> []wchar  for     []UCS                                   ; used by buffer(u)
          .bf_getcharbuffer   -> []byte   for     encode([]UCS, sys.defaultencoding)      ; used by t#  and PyObject_AsCharBuffer
          .bf_getbuffer = 0                                                               ; used by memoryview(u)
      and on py3:
          .tp_as_buffer = 0
      /reported-by @kazuhiko
      /reported-at nexedi/pygolang!21 (comment 172595)
    • Kirill Smelkov's avatar
      golang_str: Demonstrate that PyArg_Parse C-API rejects bstr and ustr in many cases · 99b9c59b
      Kirill Smelkov authored
      It was discovered that even though bstr and ustr implement __bytes__ and
      __unicode__ to coerce automatically to/from bytes, at C-API level many
      functions reject them. For example on py3 PyArg_ParseTuple("s") rejects
      bstr with
          TypeError: argument 1 must be str, not golang.bstr
      -> Add a test to demonstrate such rejections. We will be lifting
      bstr/ustr acceptance level incrementally step-by-step in small steps,
      /reported-by @kazuhiko
      /reported-at nexedi/pygolang!21 (comment 172595)
  6. 20 Dec, 2024 24 commits
    • Kirill Smelkov's avatar
      runtime: New package that mirrors Go's runtime (stub) · 902a93e9
      Kirill Smelkov authored
      Only runtime.OS and runtime.CC for now - exposed as strings from
      runtime/platform.h ifdefs.
    • Kirill Smelkov's avatar
      *: Centralize detection of OS and compiler in golang/runtime/platform.h · 704cbb60
      Kirill Smelkov authored
      Up until now we had scattered ifdef __linux__, ifdef __APPLE__, ifdef
      _MSC_VER etc. And in the future we will need to detect more things with
      more involved conditions.
      -> Factor the code, that detects OS and compiler, into one place as a
      preparatory step for that.
    • Kirill Smelkov's avatar
      golang_str_pickle: Also verify loads/save of standard *UNICODE opcodes · d2c36212
      Kirill Smelkov authored
      Becuase we rely on that for bstr/ustr load/save, and quote corresponding
      pickle strings in the test for bstr/ustr, and also because we will patch
      pickle's string processing later and will be good to keep making sure we
      don't break standard builtin behaviour.
    • Kirill Smelkov's avatar
      golang_str_pickle: Fix it so that py3 can load what py2 saved and back · 1ec5ed82
      Kirill Smelkov authored
      Since ebd18f3f (golang_str: bstr/ustr pickle support) bstr and ustr have
      support for pickling. However in that patch I verified that it is
      possible to dump and load back an object only on the same python
      version, which missed that fact that a bstr pickled on py2 cannot be
      loaded on py3:
      on py2:
          (z-dev) kirr@deca:~/src/tools/go/pygolang$ ipython
          Python 2.7.18 (default, Apr 28 2021, 17:39:59)
          In [1]: from golang import *
          In [2]: s = bstr('мир') + b'\xff'
          In [3]: s
          Out[3]: b(b'мир\xff')
          In [5]: import pickle
          In [6]: p = pickle.dumps(1)
          In [7]: p
          Out[7]: 'I1\n.'
          In [8]: import pickletools
          In [9]: p = pickle.dumps(s, 1)
          In [10]: p
          Out[10]: 'ccopy_reg\n_reconstructor\nq\x00(cgolang._golang\n_pybstr\nq\x01h\x01U\x07\xd0\xbc\xd0\xb8\xd1\x80\xffq\x02tq\x03Rq\x04.'
          In [11]: pickletools.dis(p)
              0: c    GLOBAL     'copy_reg _reconstructor'
             25: q    BINPUT     0
             27: (    MARK
             28: c        GLOBAL     'golang._golang _pybstr'
             52: q        BINPUT     1
             54: h        BINGET     1
             56: U        SHORT_BINSTRING '\xd0\xbc\xd0\xb8\xd1\x80\xff'
             65: q        BINPUT     2
             67: t        TUPLE      (MARK at 27)
             68: q    BINPUT     3
             70: R    REDUCE
             71: q    BINPUT     4
             73: .    STOP
          highest protocol among opcodes = 1
      on py3:
          (py39.venv) kirr@deca:~/src/tools/go/pygolang-master$ ipython
          Python 3.9.19+ (heads/3.9:40d77b93672, Apr 12 2024, 06:40:05)
          In [1]: from golang import *
          In [2]: import pickle
          In [3]: p = b'ccopy_reg\n_reconstructor\nq\x00(cgolang._golang\n_pybstr\nq\x01h\x01U\x07\xd0\xbc\xd0\xb8\xd1\x80\xffq\x02tq\x03Rq\x04.'
          In [4]: s = pickle.loads(p)
          UnicodeDecodeError                        Traceback (most recent call last)
          Cell In[4], line 1
          ----> 1 s = pickle.loads(p)
          UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
      which happens in the above example because pickling bstr relies on
      SHORT_BINSTRING opcode which is not really handled well on py3.
      -> Rework how bstr and ustr are pickled by fully taking control on what
      we emit at which protocol level and how and asserting in tests that
      pickling produces exactly the data, that is expected to be on the
      This way we know that pickling bstr/ustr works the same way on both py2
      and py3 and, by also asserting that that data can be unpickled and into
      the same string object, that both py2 and py3 can load what any of py2
      or py3 saved.
      For the reference the dump for above b(b'мир\xff') now becomes
          In [5]: p
          Out[5]: 'cgolang\nbstr\nq\x00(X\t\x00\x00\x00\xd0\xbc\xd0\xb8\xd1\x80\xed\xb3\xbfq\x01tq\x02Rq\x03.'
          In [7]: pickletools.dis(p)
              0: c    GLOBAL     'golang bstr'
             13: q    BINPUT     0
             15: (    MARK
             16: X        BINUNICODE u'\u043c\u0438\u0440\udcff'
             30: q        BINPUT     1
             32: t        TUPLE      (MARK at 15)
             33: q    BINPUT     2
             35: R    REDUCE
             36: q    BINPUT     3
             38: .    STOP
          highest protocol among opcodes = 1
      See comments in the code, and added golden vectors in the test for details.
    • Kirill Smelkov's avatar
      golang_str_pickle: tests: Add pickle normalization utility · 33576c9e
      Kirill Smelkov authored
      In the next patch we will start comparing dumped pickles for bstr/ustr
      to expected golden data. However there are many irrelevant differences
      in how different pickle modules, and protocols produce them:
      - cPickle and pickle.py differ from where they start to count *PUT
        indices. cPickle start to count them from 1, while pickle.py from 0:
        In [1]: s = '123'
        In [2]: cPickle.dumps(s)
        Out[2]: "S'123'\np1\n."
        In [3]: pickle.dumps(s)
        Out[3]: "S'123'\np0\n."
      - sometimes unused *PUT and MEMOIZE opcodes are emitted.
        see previous item for an example.
      We will filter out those details and compare to golden data the
      resulting pickles after normalization, which brings pickle data into
      normalized form, which still, when loaded, will be loaded in to the same
      object as original pickle, but where there are no unused *PUT/MEMOIZE
      opcodes and where *PUT indices always start from 0.
      -> Add pickle_normalize utility as a preparatory step for that.
    • Kirill Smelkov's avatar
      golang_str_pickle: tests: Verify it on data that contains invalid UTF-8 · ace402fb
      Kirill Smelkov authored
      Everything should work even if bytes data inside bstr/ustr is not valid UTF-8.
    • Kirill Smelkov's avatar
      golang_str_pickle: tests: Verify all loads, load and Unpickler.load; same for dump* · 1139379f
      Kirill Smelkov authored
      Until now we were verifying only pickle.loads, but pickle.load and
      pickle.Unpickler.load were not covered by tests. Same for dumps vs dump
      and Pickler.dump.
      -> Add corresponding test coverage to make sure all those codepaths are
         working ok with bstr/ustr.
    • Kirill Smelkov's avatar
      golang_str_pickle: Test it wrt all pickle modules we care about · 95fd2889
      Kirill Smelkov authored
      We want bstr/ustr pickling support to be robust. So we need to test it
      against all pickle modules that are in use. This includes python pickle
      version from stdlib (pickle.py), C pickle version from stdlib (cPickle
      on py2 and _pickle on py3) and, correspondingly, py and C versions from
      -> Adjust pickling tests to cover all those variants.
    • Kirill Smelkov's avatar
      golang_str: Move everything related to pickling to golang_str_pickle.pyx · eec321e7
      Kirill Smelkov authored
      In the future we will be adding more functionality and tests related to
      pickling. So it makes sense to keep pickle-related functionality in its
      own unit.
      -> Move the code to golang_str_pickle* as a preparatory step for that.
    • Kirill Smelkov's avatar
      golang_str: Fix pybstr/pyustr .tp_dealloc wrt upcoming str=bstr and unicode=ustr · b724d3ee
      Kirill Smelkov authored
      For pybstr/pyustr cython generates .tp_dealloc that refer to
      bytes/unicode types directly. That works ok in normal circumstances, but
      will lead to crash when gpython will start patching builtin str and
      unicode types with bstr and ustr:
          (py39.venv) kirr@deca:~/src/tools/go/pygolang-master$ gpython
          Ошибка сегментирования (образ памяти сброшен на диск)
          (py39.venv) kirr@deca:~/src/tools/go/pygolang-master$ gdb python core
          Core was generated by `/home/kirr/src/tools/go/py39.venv/bin/python3.9 /home/kirr/src/tools/go/py39.ve'.
          Program terminated with signal SIGSEGV, Segmentation fault.
          #0  0x00007f2edb247d5c in PyType_HasFeature (type=<error reading variable: Cannot access memory at address 0x7ffc6ca1bff8>,
              feature=<error reading variable: Cannot access memory at address 0x7ffc6ca1bff0>)
              at /home/kirr/local/py3.9/include/python3.9/object.h:622
          622     {
          (gdb) bt
          #0  0x00007f2edb247d5c in PyType_HasFeature (type=<error reading variable: Cannot access memory at address 0x7ffc6ca1bff8>,
              feature=<error reading variable: Cannot access memory at address 0x7ffc6ca1bff0>)
              at /home/kirr/local/py3.9/include/python3.9/object.h:622
          #1  0x00007f2edb2f4b28 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88982
          #2  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #3  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #4  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #5  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #6  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #7  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #8  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #9  0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
          #10 0x00007f2edb2f4bc8 in __pyx_tp_dealloc_6golang_7_golang__pyustr (o=0x7f2edae99030) at golang/_golang.cpp:88986
      -> Fix that crash by manually repointing .tp_dealloc of bstr/ustr to
      .tp_dealloc of original bytes and unicode.
    • Kirill Smelkov's avatar
      golang_str: tests: Fix test_strings_methods wrt upcoming str=bstr and unicode=ustr · e6570490
      Kirill Smelkov authored
      When gpython will start patching builtin str and unicode types with bstr
      and ustr the first argument to assertDeepEQ will have builtin str or
      unicode type and the existing
          assert not isinstance(a, (bstr, ustr))
      will break.
      -> Rewrite that assert to do equivalent check carefully that does not
      break when str/unicode types are patched with bstr and ustr.
    • Kirill Smelkov's avatar
      golang_str: tests: Fix test_strings_refcount wrt upcoming str=bstr and unicode=ustr · c596ab9c
      Kirill Smelkov authored
      When gpython will start patching builtin str and unicode types with bstr
      and ustr it might be the case that b('abc') return the same 'abc' object
      and so the logic in this test will become broken.
      -> Avoid that by keeping the original data in bytearray which for sure
         won't overlap with bytes/str nor unicode irregardless whether those
         builtin types are patched or not.
    • Kirill Smelkov's avatar
      golang_str: tests: Robustify xbytes/xunicode/xbytearray · 674b74e1
      Kirill Smelkov authored
      Assert that input belongs to the set of expected types.
      Assert that the output has exactly the type we promised.
      No change in functionality. We are now just more certain that those
      functions work as intended and could be relied upon.
    • Kirill Smelkov's avatar
      golang_str: Fix bstr/ustr %-formatting wrt tuple subclass · fe5ab935
      Kirill Smelkov authored
      In 390fd810 (golang_str: bstr/ustr %-formatting) I've implemented
      percent formatting but missed to handle tuple-subclass argv correctly.
      For example the following works with std string:
          In [1]: import collections as cc
          In [5]: Point = cc.namedtuple('Point', ['x', 'y'])
          In [9]: 'α %s %s π' % Point('β','γ')
          Out[9]: '\xce\xb1 \xce\xb2 \xce\xb3 \xcf\x80'
      while it fails with ustr:
          In [8]: ustr('α %s %s π') % Point('β','γ')
          TypeError                                 Traceback (most recent call last)
          <ipython-input-8-4f1a97267f2a> in <module>()
          ----> 1 ustr('α %s %s π') % Point('β','γ')
          /home/kirr/src/tools/go/pygolang/golang/_golang_str.pyx in golang._golang._pyustr.__mod__()
              850     # %-formatting
              851     def __mod__(a, b):
          --> 852         return pyu(pyb(a).__mod__(b))
              853     def __rmod__(b, a):
              854         # ("..." % x)  calls  "x.__rmod__()" for string subtypes
          /home/kirr/src/tools/go/pygolang/golang/_golang_str.pyx in golang._golang._pybstr.__mod__()
              473     # %-formatting
              474     def __mod__(a, b):
          --> 475         return _bprintf(a, b)
              476     def __rmod__(b, a):
              477         # ("..." % x)  calls  "x.__rmod__()" for string subtypes
          /home/kirr/src/tools/go/pygolang/golang/_golang_str.pyx in golang._golang._bprintf()
             1649     if isinstance(xarg, tuple):
          -> 1650         argv = xarg
             1651         xarg = _missing
          TypeError: Expected tuple, got Point
      -> Fix that.
    • Kirill Smelkov's avatar
      golang_str: tests: Make test_strings_mod_and_format more robust with upcoming unicode=ustr · 3f221568
      Kirill Smelkov authored
      Previously test_strings_mod_and_format was testing % and .format via
      compareing bstr and ustr results with similar result for unicode. This
      works reasonably ok. However under gpython, when unicode will be
      replaced with ustr, it will no longer compare results of bstr/ustr
      methods with something good and external - indeed in that case bstr/ustr
      e.g. result of % will be compared to result of ustr % which opens the
      door for bugs to stay unnoticed.
      -> Adjust the test, similarly to 9a075b17 (golang_str: tests: Make
      test_strings_methods more robust with upcoming unicode=ustr), to
      explicitly provide expected result for all entries in the test vector.
      We make sure those results are good and match std python because we also
      assert that unicode % and .format match it.
    • Kirill Smelkov's avatar
      golang_str: Fix ustr.translate on sequence · d76d5e1a
      Kirill Smelkov authored
      NumPy uses s.translate(str) and under gpython/py3 with str patched to be
      ustr it breaks with:
            File ".../numpy-1.24.4-py3.9-linux-x86_64.egg/numpy/core/_string_helpers.py", line 40, in english_lower
              lowered = s.translate(LOWER_TABLE)
            File "golang/_golang_str.pyx", line 909, in golang._golang._pyustr.translate
          AttributeError: 'str' object has no attribute 'items'
      https://docs.python.org/3/library/stdtypes.html#str.translate documents
      translate to work on both mappings and sequences, so my usage of
      table.items() in ff24be3d (golang_str: bstr/ustr string methods) was not
      -> Fix it by reworking ustr.translate to use our proxy mapping instead
      of going through all items of original table in the beginning.
    • Kirill Smelkov's avatar
      golang_str: tests: Fix thinko wrt \u in tests · b31c5fa2
      Kirill Smelkov authored
      On py2 \u does not work in str literals - only in unicode ones.
      This corrects all tests that were doing x32 incorrectly due to the thinko.
    • Kirill Smelkov's avatar
      golang_str: Fix bstr/ustr .__str__ to always return bstr/ustr even for subclasses · d4dcf5dd
      Kirill Smelkov authored
      This behaviour is provided by builtin str and we were not following it:
          $ python3
          Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
          Type "help", "copyright", "credits" or "license" for more information.
          >>> class SSS(str): pass
          >>> z = SSS('abc')
          >>> z
          >>> type(z)
          <class '__main__.SSS'>
          >>> q = str(z)
          >>> q
          >>> type(q)
          <class 'str'>
          >>> r = z.__str__()
          >>> r
          >>> type(r)
          <class 'str'>                       <-- NOTE str, not __main__.SSS
          $ gpython               # with str patched to be ustr
          >>> class SSS(str): pass
          >>> z = SSS('abc')
          >>> z
          >>> type(z)
          <class '__main__.SSS'>
          >>> q = str(z)
          >>> q
          >>> type(q)
          <class 'str'>
          >>> r = z.__str__()
          >>> r
          >>> type(r)
          <class '__main__.SSS'>              <-- NOTE not str
      which leads to crash during IPython startup on py3.11:
          $ gpython -m IPython    # with str patched to be ustr
          Traceback (most recent call last):
            File "/home/kirr/src/tools/go/py3.venv/bin/gpython", line 8, in <module>
            File "/home/kirr/src/tools/go/pygolang-master/gpython/__init__.py", line 478, in main
              pymain(argv, init)
            File "/home/kirr/src/tools/go/pygolang-master/gpython/__init__.py", line 291, in pymain
            File "/home/kirr/src/tools/go/pygolang-master/gpython/__init__.py", line 162, in run
            File "<frozen runpy>", line 198, in _run_module_as_main
            File "<frozen runpy>", line 88, in _run_code
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/__main__.py", line 15, in <module>
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/__init__.py", line 128, in start_ipython
              return launch_new_instance(argv=argv, **kwargs)
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/traitlets/config/application.py", line 1042, in launch_instance
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/traitlets/config/application.py", line 113, in inner
              return method(app, *args, **kwargs)
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/terminal/ipapp.py", line 279, in initialize
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/terminal/ipapp.py", line 293, in init_shell
              self.shell = self.interactive_shell_class.instance(parent=self,
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/traitlets/config/configurable.py", line 551, in instance
              inst = cls(*args, **kwargs)
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/terminal/interactiveshell.py", line 856, in __init__
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/terminal/interactiveshell.py", line 648, in init_prompt_toolkit_cli
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/terminal/interactiveshell.py", line 751, in _extra_prompt_options
              "lexer": IPythonPTLexer(),
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/IPython/terminal/ptutils.py", line 177, in __init__
              self.python_lexer = PygmentsLexer(l.Python3Lexer)
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/prompt_toolkit/lexers/pygments.py", line 198, in __init__
              self.pygments_lexer = pygments_lexer_cls(
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/pygments/lexer.py", line 647, in __call__
              cls._tokens = cls.process_tokendef('', cls.get_tokendefs())
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/pygments/lexer.py", line 586, in process_tokendef
              cls._process_state(tokendefs, processed, state)
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/pygments/lexer.py", line 549, in _process_state
              tokens.extend(cls._process_state(unprocessed, processed,
            File "/home/kirr/src/tools/go/py3.venv/lib/python3.11/site-packages/pygments/lexer.py", line 533, in _process_state
              assert type(state) is str, "wrong state name %r (%r)" % (state, type(state))
          AssertionError: wrong state name 'keywords' (<class 'pygments.lexer.include'>)
          If you suspect this is an IPython 8.12.0 bug, please report it at:
          or send an email to the mailing list at ipython-dev@python.org
          You can print a more detailed traceback right now with "%tb", or use "%debug"
          to interactively debug it.
          Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
      Here pygments define
          class include(str):
      and wants `str(obj)` to return str, not include if obj was instance of include.
      -> Adjust bstr/ustr .__str__() to always return bstr/ustr even for
      For consistency, do the same for .__unicode__ . In case a
      subclass wants its __str__, or __unicode__ to return self
      without casting to bstr/ustr, it can override those methods.
    • Kirill Smelkov's avatar
      golang_str: Fix bstr/ustr __add__ and friends to return NotImplemented wrt unsupported types · aa5d2f91
      Kirill Smelkov authored
      In bbbb58f0 (golang_str: bstr/ustr support for + and *) I've added
      support for binary string operations, but similarly to __eq__ did not
      handle correctly the case for arbitrary arguments that potentially
      define __radd__ and similar.
      As the result it breaks when running e.g. bstr + pyparsing.Regex
            File ".../pyparsing-2.4.7-py2.7.egg/pyparsing.py", line 6591, in pyparsing_common
              _full_ipv6_address = (_ipv6_part + (':' + _ipv6_part) * 7).setName("full IPv6 address")
            File "golang/_golang_str.pyx", line 469, in golang._golang._pybstr.__add__
              return pyb(zbytes.__add__(a, _pyb_coerce(b)))
            File "golang/_golang_str.pyx", line 243, in golang._golang._pyb_coerce
              raise TypeError("b: coerce: invalid type %s" % type(x))
          TypeError: b: coerce: invalid type <class 'pyparsing.Regex'>
      because pyparsing.Regex is a type, that does not inherit from str, but defines
      its own __radd__ to handle str + Regex as Regex.
      -> Fix it by returning NotImplemented from under __add__ and other operations
      where it is needed so that bstr and ustr behave in the same way as builtin str
      wrt third types, but care to handle bstr/ustr promise that
          only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not.
    • Kirill Smelkov's avatar
      golang_str: Fix bstr/ustr __eq__ and friends to return NotImplemented wrt non-string types · 09694757
      Kirill Smelkov authored
      In 54c2a3cf (golang_str: Teach bstr/ustr to compare wrt any
      string with automatic coercion) I've added __eq__, __ne__, __lt__ etc
      methods to our strings, but __lt__ and other comparison to raise
      TypeError against any non-string type. My idea was to mimic user-visible
      py3 behaviour such as
          >>> "abc" > 1
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
          TypeError: '>' not supported between instances of 'str' and 'int'
      However it turned out that the implementation was not exactly matching
      what Python is doing internally which lead to incorrect behaviour when
      bstr or ustr is compared wrt another type with its own __cmp__. In the
      general case for `a op b` Python first queries a.__op__(b) and
      b.__op'__(a) and sometimes other methods before going to .__cmp__. This
      relies on the methods to return NotImplemented instead of raising an
      exception and if a trial raises TypeError everything is stopped and that
      TypeError is returned to the caller.
      Jérome reports a real breakage due to this when bstr is compared wrt
      distutils.version.LooseVersion . LooseVersion is basically
          class LooseVersion(Version):
              def __cmp__ (self, other):
                  if isinstance(other, StringType):
                      other = LooseVersion(other)
                  return cmp(self.version, other.version)
      but due to my thinko on `LooseVersion < bstr` the control flow was not
      getting into that LooseVersion.__cmp__ because bstr.__gt__ was tried
      first and raised TypeError.
      -> Fix all comparison operations to return NotImplemented instead of
      raising TypeError and make sure in the tests that this behaviour exactly
      matches what native str type does.
      The fix is needed not only for py2 because added test_strings_cmp_wrt_distutils_LooseVersion
      was failing on py3 as well without the fix.
      /reported-by @jerome
      /reported-on nexedi/slapos!1575 (comment 206080)
    • Kirill Smelkov's avatar
      golang_str: Add ustr.decode for symmetry with bstr.decode and because gpy2 breaks without it · da4b857b
      Kirill Smelkov authored
      Without working unicode.decode gpython/py2 with unicode replaced by ustr
      fails when running ERP5 as follows:
          $ /srv/slapgrid/slappart49/t/ekg/i/5/bin/runTestSuite --help
          No handlers could be found for logger "SecurityInfo"
          Traceback (most recent call last):
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/bin/.runTestSuite.pyexe", line 296, in <module>
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/parts/pygolang/gpython/__init__.py", line 484, in main
              pymain(argv, init)
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/parts/pygolang/gpython/__init__.py", line 292, in pymain
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/parts/pygolang/gpython/__init__.py", line 192, in run
              _execfile(filepath, mmain.__dict__)
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/parts/pygolang/gpython/__init__.py", line 339, in _execfile
              six.exec_(code, globals, locals)
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/eggs/six-1.16.0-py2.7.egg/six.py", line 735, in exec_
              exec("""exec _code_ in _globs_, _locs_""")
            File "<string>", line 1, in <module>
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/bin/runTestSuite", line 10, in <module>
              from Products.ERP5Type.tests.runTestSuite import main; sys.exit(main())
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/parts/erp5/product/ERP5Type/__init__.py", line 96, in <module>
              from . import ZopePatch
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/parts/erp5/product/ERP5Type/ZopePatch.py", line 75, in <module>
              from Products.ERP5Type.patches import ZopePageTemplateUtils
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/parts/erp5/product/ERP5Type/patches/ZopePageTemplateUtils.py", line 58, in <module>
              convertToUnicode(u'', 'text/xml', ())
            File "/srv/slapgrid/slappart49/t/ekg/soft/b5048b47894a7612651c7fe81c2c8636/eggs/Zope-4.8.9+slapospatched002-py2.7.egg/Products/PageTemplates/utils.py", line 73, in convertToUnicode
              return source.decode(encoding), encoding
          AttributeError: unreadable attribute
      and in general if we treat both bstr ans ustr being two different
      representations of the same entity, if we have bstr.decode, having
      ustr.decode is also needed for symmetry with both operations converting
      bytes representation of the string into unicode.
      Now there is full symmetry in between bstr/ustr and encode/decode. Quoting updated encode/decode text:
          Encode encodes unicode representation of the string into bytes, leaving string domain.
          Decode decodes bytes   representation of the string into ustr, staying inside string domain.
          Both bstr and ustr are accepted by encode and decode treating them as two
          different representations of the same entity.
          On encoding, for bstr, the string representation is first converted to
          unicode and encoded to bytes from there. For ustr unicode representation
          of the string is directly encoded.
          On decoding, for ustr, the string representation is first converted to
          bytes and decoded to unicode from there. For bstr bytes representation of
          the string is directly decoded.
    • Kirill Smelkov's avatar
      golang_str: Adjust bstr/ustr .encode() and .__bytes__ to leave string domain into bytes · 6f26b32c
      Kirill Smelkov authored
      Initially in 023907ee (golang_str: bstr/ustr encode/decode) I
      implemented things in such a way that (b|u)str.__bytes__ were giving
      bstr and ustr.encode() was giving bstr as well. My logic here was that
      bstr is based on bytes and it is ok to give that.
      However this logic did not pass backward compatibility test: for example
      when LXML is imported it does
          cdef bytes _FILENAME_ENCODING = (sys.getfilesystemencoding() or sys.getdefaultencoding() or 'ascii').encode("UTF-8")
      and under gpython/py3 with unicode patched to be ustr it breaks with
            File "/srv/slapgrid/slappart47/srv/runner/software/7f1663e8148f227ce3c6a38fc52796e2/bin/runwsgi", line 4, in <module>
              from Products.ERP5.bin.zopewsgi import runwsgi; sys.exit(runwsgi())
            File "/srv/slapgrid/slappart47/srv/runner/software/7f1663e8148f227ce3c6a38fc52796e2/parts/erp5/product/ERP5/__init__.py", line 36, in <module>
              from Products.ERP5Type.Utils import initializeProduct, updateGlobals
            File "/srv/slapgrid/slappart47/srv/runner/software/7f1663e8148f227ce3c6a38fc52796e2/parts/erp5/product/ERP5Type/__init__.py", line 42, in <module>
              from .patches import pylint
            File "/srv/slapgrid/slappart47/srv/runner/software/7f1663e8148f227ce3c6a38fc52796e2/parts/erp5/product/ERP5Type/patches/pylint.py", line 524, in <module>
              __import__(module_name, fromlist=[module_name], level=0))
            File "src/lxml/sax.py", line 18, in init lxml.sax
            File "src/lxml/etree.pyx", line 154, in init lxml.etree
          TypeError: Expected bytes, got golang.bstr
      The breakage highlights a thinko in my previous reasoning: yes bstr is based on
      bytes, but bstr has different semantics compared to bytes: even though e.g.
      __getitem__ works the same way for bytes on py2, it works differently compared
      to py3. This way if on py3 a program is doing bytes(x) or x.encode() it then
      expects the result to have bytes semantics of current python which is not the
      case if the result is bstr.
      -> Fix that by adjusting .encode() and .__bytes__() to produce bytes type of
         current python and leave string domain.
      I initially was contemplating for some time to introduce a third type, e.g.
      bvec also based on bytes, but having bytes semantic and that bvec.decode would
      return back to pygolang strings domain. But due to the fact that bytes semantic
      is different in between py2 and py3, it would mean that bvec provided by
      pygolang would need to have different behaviours dependent on current python
      version which is undesirable.
      In the end with leaving into native bytes the "bytes inconsistency" problem is
      left to remain under std python with pygolang targeting only to fix strings
      inconsistency in between py2 and py3 and providing the same semantic for
      bstr and ustr on all python versions.
      It also does not harm that bytes.decode() returns std unicode instead of ustr:
      for programs that run under unpatched python we have u() to convert the result
      to ustr, while under gpython std unicode is actually ustr which makes
      bytes.decode() behaviour still quite ok.
      P.S. we enable bstr.encode for consistency and because under py2, if not
      enabled, it will break when running pytest under gpython in
                File ".../_pytest/assertion/rewrite.py", line 352, in <module>
                  RN = "\r\n".encode("utf-8")
              AttributeError: unreadable attribute
    • Kirill Smelkov's avatar
      golang_str: Fix iter(bstr) to yield byte instead of unicode character · 8d76276c
      Kirill Smelkov authored
      In a72c1c1a (golang_str: bstr/ustr iteration) things were initially
      implemented to follow Go semantic exactly with bytestring iteration
      yielding unicode characters as explained in
      https://blog.golang.org/strings. However this makes bstr not a 100%
      drop-in compatible replacement for std str under py2, and even though my
      initial testing was saying this change does not affect programs in
      practice it turned out to be not the case.
      For example with bstr.__iter__ yielding unicode characters running
      gpython on py2 with builtin str patched to be bstr will break sometimes
      when importing uuid:
      There uuid reads 16 bytes from /dev/random and then wants to iterate
      those 16 bytes as single bytes and then expects that the length
      of the resulting sequence is exactly 16:
           int = long(('%02x'*16) % tuple(map(ord, bytes)), 16)
           ( https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/uuid.py#L147 )
      which breaks if some of the read bytes are higher than 0x7f.
      Even though this particular problem could be worked-around with
      patching uuid, there is no evidence that there will be no similar
      problems later, which could be many.
      -> So adjust bstr semantic instead to follow semantic of str under py2
         and introduce uiter() primitive to still be able to iterate
         bytestrings as unicode characters.
      This makes bstr, hopefully, to be fully compatible with str on py2 while
      still providing reasonably good approach for strings processing the
      Go-way when needed.
      Add biter as well for symmetry.
          !21 (comment 170754)
          !21 (comment 170782)
          !21 (comment 206044)
      for discussion on iter(bstr) topic.
    • Kirill Smelkov's avatar
      strconv: Optimize quoting lightly · a11cb5dc
      Kirill Smelkov authored
      Add type annotations and use C-level objects instead of py-ones where it
      is easy to do. We are not all-good yet, but this already brings some noticable speedup:
          name                 old time/op  new time/op  delta
          quote[a]              786µs ± 1%    10µs ± 0%  -98.76%  (p=0.016 n=4+5)
          quote[\u03b1]        1.12ms ± 0%  0.41ms ± 0%  -63.37%  (p=0.008 n=5+5)
          quote[\u65e5]         738µs ± 2%   258µs ± 0%  -65.07%  (p=0.016 n=4+5)
          quote[\U0001f64f]     920µs ± 1%    78µs ± 0%  -91.46%  (p=0.016 n=5+4)
          stdquote             1.19µs ± 0%  1.19µs ± 0%     ~     (p=0.794 n=5+5)
          unquote[a]           1.08ms ± 0%  1.08ms ± 1%     ~     (p=0.548 n=5+5)
          unquote[\u03b1]       797µs ± 0%   807µs ± 1%   +1.23%  (p=0.008 n=5+5)
          unquote[\u65e5]       522µs ± 0%   520µs ± 1%     ~     (p=0.056 n=5+5)
          unquote[\U0001f64f]  3.21ms ± 0%  3.14ms ± 0%   -2.13%  (p=0.008 n=5+5)
          stdunquote            815ns ± 0%   836ns ± 0%   +2.63%  (p=0.008 n=5+5)
  7. 16 Dec, 2024 6 commits
    • Kirill Smelkov's avatar
      golang, strconv: Switch them to cimport each other at pyx level · e5c513bf
      Kirill Smelkov authored
      Since 50b8cb7e (strconv: Move functionality related to UTF8
      encode/decode into _golang_str) both golang_str and strconv import each
      Before this patch that import was done at py level at runtime from
      outside to workaround the import cycle. This results in that strconv
      functionality is not available while golang is only being imported.
      So far it was not a problem, but when builtin string types will become
      patched with bstr and ustr, that will become a problem because string
      repr starts to be used at import time, which for pybstr is implemented
      via strconv.quote .
      -> Fix this by switching golang and strconv to cimport each other at pyx
      level. There, similarly to C, the cycle works just ok out of the box.
      This also automatically helps performance a bit:
          name                 old time/op  new time/op  delta
          quote[a]              805µs ± 0%   786µs ± 1%   -2.40%  (p=0.016 n=5+4)
          quote[\u03b1]        1.21ms ± 0%  1.12ms ± 0%   -7.47%  (p=0.008 n=5+5)
          quote[\u65e5]         785µs ± 0%   738µs ± 2%   -5.97%  (p=0.016 n=5+4)
          quote[\U0001f64f]    1.04ms ± 0%  0.92ms ± 1%  -11.73%  (p=0.008 n=5+5)
          stdquote             1.18µs ± 0%  1.19µs ± 0%   +0.54%  (p=0.008 n=5+5)
          unquote[a]           1.26ms ± 0%  1.08ms ± 0%  -14.66%  (p=0.008 n=5+5)
          unquote[\u03b1]       911µs ± 1%   797µs ± 0%  -12.55%  (p=0.008 n=5+5)
          unquote[\u65e5]       592µs ± 0%   522µs ± 0%  -11.81%  (p=0.008 n=5+5)
          unquote[\U0001f64f]  3.46ms ± 0%  3.21ms ± 0%   -7.34%  (p=0.008 n=5+5)
          stdunquote            812ns ± 1%   815ns ± 0%     ~     (p=0.183 n=5+5)
    • Kirill Smelkov's avatar
      strconv: Move it to pyx · 2684dc94
      Kirill Smelkov authored
      So far this is plain code movement with no type annotations added and
      internal from-strconv imports still being done via py level.
      As expected this does not help practically for performance yet:
          name                 old time/op  new time/op  delta
          quote[a]              910µs ± 0%   805µs ± 0%  -11.54%  (p=0.008 n=5+5)
          quote[\u03b1]        1.23ms ± 0%  1.21ms ± 0%   -1.24%  (p=0.008 n=5+5)
          quote[\u65e5]         800µs ± 0%   785µs ± 0%   -1.86%  (p=0.016 n=4+5)
          quote[\U0001f64f]    1.06ms ± 1%  1.04ms ± 0%   -1.92%  (p=0.008 n=5+5)
          stdquote             1.17µs ± 0%  1.18µs ± 0%   +0.80%  (p=0.008 n=5+5)
          unquote[a]           1.33ms ± 1%  1.26ms ± 0%   -5.13%  (p=0.008 n=5+5)
          unquote[\u03b1]       952µs ± 2%   911µs ± 1%   -4.25%  (p=0.008 n=5+5)
          unquote[\u65e5]       613µs ± 2%   592µs ± 0%   -3.48%  (p=0.008 n=5+5)
          unquote[\U0001f64f]  3.62ms ± 1%  3.46ms ± 0%   -4.32%  (p=0.008 n=5+5)
          stdunquote            788ns ± 0%   812ns ± 1%   +3.07%  (p=0.016 n=4+5)
    • Kirill Smelkov's avatar
      unicode/utf8: Start of the package (stub) · cd69a8ad
      Kirill Smelkov authored
      We will soon need to use error rune codepoint from both golang_str.pyx
      and strconv.pyx - so we need to move that definition into shared place.
      What fits best is unicode/utf8, so start that package and move the
      constant there.
    • Kirill Smelkov's avatar
      *: uint8_t -> byte, unicode-codepint -> rune · bd662e01
      Kirill Smelkov authored
      We added byte and rune types in the previous patch. Let's use them now
      throughout whole codebase where appropriate.
      Currently the only place where unicode-codepoint is used is
      _utf8_decode_rune. uint8_t was used in many places.
    • Kirill Smelkov's avatar
      golang, libgolang: Add byte / rune types · 7505febc
      Kirill Smelkov authored
      Those types are the base when working with byte- and unicode strings.
      It will be clearer to use them explicitly instead of uint8_t and int32_t
      when processing string.
    • Kirill Smelkov's avatar
      strconv: Add benchmarks for quote and unquote · 23f0a47c
      Kirill Smelkov authored
      This functions are currently relatively slow. They were initially used
      in zodbdump and zodbrestore, where their speed did not matter much, but
      with bstr and ustr, since e.g. quote is used in repr, not having them to
      perform with speed similar to builtin string escaping starts to be an
      issue. Tatuya Kamada reports at nexedi/pygolang!21 (comment 170833) :
          ### 3. `u` seems slow with large arrays especially when `repr` it
          I have faced a slowness while testing `u`, `b` with python 2.7, especially with `repr`.
          >>> timeit.timeit("from golang import b,u; u('あ'*199998)", number=10)
          >>> timeit.timeit("from golang import b,u; repr(u('あ'*199998))", number=10)
          `bytes`(str) is very fast.
          >>> timeit.timeit("from golang import b,u; bytes('あ'*199998)", number=10)
          >>> timeit.timeit("from golang import b,u; repr(bytes('あ'*199998))", number=10)
          `b` is much faster than `u`, but still the repr seems slow.
          >>> timeit.timeit("from golang import b,u; b('あ'*199998)", number=10)
          >>> timeit.timeit("from golang import b,u; repr(b('あ'*199998))", number=10)
      The "repr" part of this problem is due to that both bstr.__repr__ and
      ustr.__repr__ use custom quoting routines which currently are implemented in
      pure python in strconv module:
      The fix would be to move strconv.py to Cython and to correspondingly rework it
      to avoid using python-level constructs during quoting internally.
      Working on that was not a priority, but soon I will need to move strconv to
      Cython for another reason: to be able to break import cycle in between _golang
      and strconv.
      So it makes sense to add strconv benchmark first - since we'll start moving it
      to Cython anyway - to see where we are and how further changes will help
      Currently we are at
          name                 time/op
          quote[a]              910µs ± 0%
          quote[\u03b1]        1.23ms ± 0%
          quote[\u65e5]         800µs ± 0%
          quote[\U0001f64f]    1.06ms ± 1%
          stdquote             1.17µs ± 0%
          unquote[a]           1.33ms ± 1%
          unquote[\u03b1]       952µs ± 2%
          unquote[\u65e5]       613µs ± 2%
          unquote[\U0001f64f]  3.62ms ± 1%
          stdunquote            788ns ± 0%
      i.e. on py2 quoting is ~ 1000x slower than builtin string escaping, and unquoting is
      even slower.
      on py3 the situation is better, but still not good:
          name                 time/op
          quote[a]              579µs ± 1%
          quote[\u03b1]         942µs ± 1%
          quote[\u65e5]         595µs ± 0%
          quote[\U0001f64f]     274µs ± 1%
          stdquote             2.70µs ± 0%
          unquote[a]            696µs ± 1%
          unquote[\u03b1]       763µs ± 0%
          unquote[\u65e5]       474µs ± 1%
          unquote[\U0001f64f]   187µs ± 0%
          stdunquote            808ns ± 0%
      δ(py2, py3) for the reference:
          name                 py2 time/op  py3 time/op  delta
          quote[a]              910µs ± 0%   579µs ± 1%   -36.42%  (p=0.008 n=5+5)
          quote[\u03b1]        1.23ms ± 0%  0.94ms ± 1%   -23.17%  (p=0.008 n=5+5)
          quote[\u65e5]         800µs ± 0%   595µs ± 0%   -25.63%  (p=0.016 n=4+5)
          quote[\U0001f64f]    1.06ms ± 1%  0.27ms ± 1%   -74.23%  (p=0.008 n=5+5)
          stdquote             1.17µs ± 0%  2.70µs ± 0%  +129.71%  (p=0.008 n=5+5)
          unquote[a]           1.33ms ± 1%  0.70ms ± 1%   -47.71%  (p=0.008 n=5+5)
          unquote[\u03b1]       952µs ± 2%   763µs ± 0%   -19.82%  (p=0.008 n=5+5)
          unquote[\u65e5]       613µs ± 2%   474µs ± 1%   -22.76%  (p=0.008 n=5+5)
          unquote[\U0001f64f]  3.62ms ± 1%  0.19ms ± 0%   -94.84%  (p=0.016 n=5+4)
          stdunquote            788ns ± 0%   808ns ± 0%    +2.59%  (p=0.016 n=4+5)