Commit 23f0a47c authored by Kirill Smelkov's avatar Kirill Smelkov

strconv: Add benchmarks for quote and unquote

This functions are currently relatively slow. They were initially used
in zodbdump and zodbrestore, where their speed did not matter much, but
with bstr and ustr, since e.g. quote is used in repr, not having them to
perform with speed similar to builtin string escaping starts to be an
issue. Tatuya Kamada reports at nexedi/pygolang!21 (comment 170833) :

    ### 3. `u` seems slow with large arrays especially when `repr` it

    I have faced a slowness while testing `u`, `b` with python 2.7, especially with `repr`.

    ```python
    >>> timeit.timeit("from golang import b,u; u('あ'*199998)", number=10)
    2.02020001411438
    >>> timeit.timeit("from golang import b,u; repr(u('あ'*199998))", number=10)
    54.60263395309448
    ```

    `bytes`(str) is very fast.

    ```python
    >>> timeit.timeit("from golang import b,u; bytes('あ'*199998)", number=10)
    0.000392913818359375
    >>> timeit.timeit("from golang import b,u; repr(bytes('あ'*199998))", number=10)
    0.4604980945587158
    ```

    `b` is much faster than `u`, but still the repr seems slow.

    ```
    >>> timeit.timeit("from golang import b,u; b('あ'*199998)", number=10)
    0.0009968280792236328
    >>> timeit.timeit("from golang import b,u; repr(b('あ'*199998))", number=10)
    25.498882055282593
    ```

The "repr" part of this problem is due to that both bstr.__repr__ and
ustr.__repr__ use custom quoting routines which currently are implemented in
pure python in strconv module:

https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L282-291
https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L582-591
https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L941-970
https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/strconv.py#L31-92

The fix would be to move strconv.py to Cython and to correspondingly rework it
to avoid using python-level constructs during quoting internally.

Working on that was not a priority, but soon I will need to move strconv to
Cython for another reason: to be able to break import cycle in between _golang
and strconv.

So it makes sense to add strconv benchmark first - since we'll start moving it
to Cython anyway - to see where we are and how further changes will help
performance-wise.

Currently we are at

    name                 time/op
    quote[a]              910µs ± 0%
    quote[\u03b1]        1.23ms ± 0%
    quote[\u65e5]         800µs ± 0%
    quote[\U0001f64f]    1.06ms ± 1%
    stdquote             1.17µs ± 0%
    unquote[a]           1.33ms ± 1%
    unquote[\u03b1]       952µs ± 2%
    unquote[\u65e5]       613µs ± 2%
    unquote[\U0001f64f]  3.62ms ± 1%
    stdunquote            788ns ± 0%

i.e. on py2 quoting is ~ 1000x slower than builtin string escaping, and unquoting is
even slower.

on py3 the situation is better, but still not good:

    name                 time/op
    quote[a]              579µs ± 1%
    quote[\u03b1]         942µs ± 1%
    quote[\u65e5]         595µs ± 0%
    quote[\U0001f64f]     274µs ± 1%
    stdquote             2.70µs ± 0%
    unquote[a]            696µs ± 1%
    unquote[\u03b1]       763µs ± 0%
    unquote[\u65e5]       474µs ± 1%
    unquote[\U0001f64f]   187µs ± 0%
    stdunquote            808ns ± 0%

δ(py2, py3) for the reference:

    name                 py2 time/op  py3 time/op  delta
    quote[a]              910µs ± 0%   579µs ± 1%   -36.42%  (p=0.008 n=5+5)
    quote[\u03b1]        1.23ms ± 0%  0.94ms ± 1%   -23.17%  (p=0.008 n=5+5)
    quote[\u65e5]         800µs ± 0%   595µs ± 0%   -25.63%  (p=0.016 n=4+5)
    quote[\U0001f64f]    1.06ms ± 1%  0.27ms ± 1%   -74.23%  (p=0.008 n=5+5)
    stdquote             1.17µs ± 0%  2.70µs ± 0%  +129.71%  (p=0.008 n=5+5)
    unquote[a]           1.33ms ± 1%  0.70ms ± 1%   -47.71%  (p=0.008 n=5+5)
    unquote[\u03b1]       952µs ± 2%   763µs ± 0%   -19.82%  (p=0.008 n=5+5)
    unquote[\u65e5]       613µs ± 2%   474µs ± 1%   -22.76%  (p=0.008 n=5+5)
    unquote[\U0001f64f]  3.62ms ± 1%  0.19ms ± 0%   -94.84%  (p=0.016 n=5+4)
    stdunquote            788ns ± 0%   808ns ± 0%    +2.59%  (p=0.016 n=4+5)
parent e27197ce
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2022 Nexedi SA and Contributors.
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -26,7 +26,10 @@ from golang.gcompat import qq
from six import int2byte as bchr
from six.moves import range as xrange
from pytest import raises
from pytest import raises, mark
import codecs
def byterange(start, stop):
b = b""
......@@ -138,3 +141,52 @@ def test_unquote_bad():
with raises(ValueError) as exc:
unquote(tin)
assert exc.value.args == (err,)
# ---- benchmarks ----
# quoting + unquoting
uchar_testv = ['a', # ascii
u'α', # 2-bytes utf8
u'\u65e5', # 3-bytes utf8
u'\U0001f64f'] # 4-bytes utf8
@mark.parametrize('ch', uchar_testv)
def bench_quote(b, ch):
s = bstr_ch1000(ch)
q = quote
for i in xrange(b.N):
q(s)
def bench_stdquote(b):
s = b'a'*1000
q = repr
for i in xrange(b.N):
q(s)
@mark.parametrize('ch', uchar_testv)
def bench_unquote(b, ch):
s = bstr_ch1000(ch)
s = quote(s)
unq = unquote
for i in xrange(b.N):
unq(s)
def bench_stdunquote(b):
s = b'"' + b'a'*1000 + b'"'
escape_decode = codecs.escape_decode
def unq(s): return escape_decode(s[1:-1])[0]
for i in xrange(b.N):
unq(s)
# bstr_ch1000 returns bstr with many repetitions of character ch occupying ~ 1000 bytes.
def bstr_ch1000(ch): # -> bstr
assert len(ch) == 1
s = bstr(ch)
s = s * (1000 // len(s))
if len(s) % 3 == 0:
s += 'x'
assert len(s) == 1000
return s
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment