strconv: Add benchmarks for quote and unquote

This functions are currently relatively slow. They were initially used in zodbdump and zodbrestore, where their speed did not matter much, but with bstr and ustr, since e.g. quote is used in repr, not having them to perform with speed similar to builtin string escaping starts to be an issue. Tatuya Kamada reports at nexedi/pygolang!21 (comment 170833) : ### 3. `u` seems slow with large arrays especially when `repr` it I have faced a slowness while testing `u`, `b` with python 2.7, especially with `repr`. ```python >>> timeit.timeit("from golang import b,u; u('あ'*199998)", number=10) 2.02020001411438 >>> timeit.timeit("from golang import b,u; repr(u('あ'*199998))", number=10) 54.60263395309448 ``` `bytes`(str) is very fast. ```python >>> timeit.timeit("from golang import b,u; bytes('あ'*199998)", number=10) 0.000392913818359375 >>> timeit.timeit("from golang import b,u; repr(bytes('あ'*199998))", number=10) 0.4604980945587158 ``` `b` is much faster than `u`, but still the repr seems slow. ``` >>> timeit.timeit("from golang import b,u; b('あ'*199998)", number=10) 0.0009968280792236328 >>> timeit.timeit("from golang import b,u; repr(b('あ'*199998))", number=10) 25.498882055282593 ``` The "repr" part of this problem is due to that both bstr.__repr__ and ustr.__repr__ use custom quoting routines which currently are implemented in pure python in strconv module: https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L282-291 https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L582-591 https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L941-970 https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/strconv.py#L31-92 The fix would be to move strconv.py to Cython and to correspondingly rework it to avoid using python-level constructs during quoting internally. Working on that was not a priority, but soon I will need to move strconv to Cython for another reason: to be able to break import cycle in between _golang and strconv. So it makes sense to add strconv benchmark first - since we'll start moving it to Cython anyway - to see where we are and how further changes will help performance-wise. Currently we are at name time/op quote[a] 910µs ± 0% quote[\u03b1] 1.23ms ± 0% quote[\u65e5] 800µs ± 0% quote[\U0001f64f] 1.06ms ± 1% stdquote 1.17µs ± 0% unquote[a] 1.33ms ± 1% unquote[\u03b1] 952µs ± 2% unquote[\u65e5] 613µs ± 2% unquote[\U0001f64f] 3.62ms ± 1% stdunquote 788ns ± 0% i.e. on py2 quoting is ~ 1000x slower than builtin string escaping, and unquoting is even slower. on py3 the situation is better, but still not good: name time/op quote[a] 579µs ± 1% quote[\u03b1] 942µs ± 1% quote[\u65e5] 595µs ± 0% quote[\U0001f64f] 274µs ± 1% stdquote 2.70µs ± 0% unquote[a] 696µs ± 1% unquote[\u03b1] 763µs ± 0% unquote[\u65e5] 474µs ± 1% unquote[\U0001f64f] 187µs ± 0% stdunquote 808ns ± 0% δ(py2, py3) for the reference: name py2 time/op py3 time/op delta quote[a] 910µs ± 0% 579µs ± 1% -36.42% (p=0.008 n=5+5) quote[\u03b1] 1.23ms ± 0% 0.94ms ± 1% -23.17% (p=0.008 n=5+5) quote[\u65e5] 800µs ± 0% 595µs ± 0% -25.63% (p=0.016 n=4+5) quote[\U0001f64f] 1.06ms ± 1% 0.27ms ± 1% -74.23% (p=0.008 n=5+5) stdquote 1.17µs ± 0% 2.70µs ± 0% +129.71% (p=0.008 n=5+5) unquote[a] 1.33ms ± 1% 0.70ms ± 1% -47.71% (p=0.008 n=5+5) unquote[\u03b1] 952µs ± 2% 763µs ± 0% -19.82% (p=0.008 n=5+5) unquote[\u65e5] 613µs ± 2% 474µs ± 1% -22.76% (p=0.008 n=5+5) unquote[\U0001f64f] 3.62ms ± 1% 0.19ms ± 0% -94.84% (p=0.016 n=5+4) stdunquote 788ns ± 0% 808ns ± 0% +2.59% (p=0.016 n=4+5)

strconv: Add benchmarks for quote and unquote
This functions are currently relatively slow. They were initially used in zodbdump and zodbrestore, where their speed did not matter much, but with bstr and ustr, since e.g. quote is used in repr, not having them to perform with speed similar to builtin string escaping starts to be an issue. Tatuya Kamada reports at nexedi/pygolang!21 (comment 170833) : ### 3. `u` seems slow with large arrays especially when `repr` it I have faced a slowness while testing `u`, `b` with python 2.7, especially with `repr`. ```python >>> timeit.timeit("from golang import b,u; u('あ'*199998)", number=10) 2.02020001411438 >>> timeit.timeit("from golang import b,u; repr(u('あ'*199998))", number=10) 54.60263395309448 ``` `bytes`(str) is very fast. ```python >>> timeit.timeit("from golang import b,u; bytes('あ'*199998)", number=10) 0.000392913818359375 >>> timeit.timeit("from golang import b,u; repr(bytes('あ'*199998))", number=10) 0.4604980945587158 ``` `b` is much faster than `u`, but still the repr seems slow. ``` >>> timeit.timeit("from golang import b,u; b('あ'*199998)", number=10) 0.0009968280792236328 >>> timeit.timeit("from golang import b,u; repr(b('あ'*199998))", number=10) 25.498882055282593 ``` The "repr" part of this problem is due to that both bstr.__repr__ and ustr.__repr__ use custom quoting routines which currently are implemented in pure python in strconv module: https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L282-291 https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L582-591 https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/_golang_str.pyx#L941-970 https://lab.nexedi.com/kirr/pygolang/blob/300d7dfa/golang/strconv.py#L31-92 The fix would be to move strconv.py to Cython and to correspondingly rework it to avoid using python-level constructs during quoting internally. Working on that was not a priority, but soon I will need to move strconv to Cython for another reason: to be able to break import cycle in between _golang and strconv. So it makes sense to add strconv benchmark first - since we'll start moving it to Cython anyway - to see where we are and how further changes will help performance-wise. Currently we are at name time/op quote[a] 910µs ± 0% quote[\u03b1] 1.23ms ± 0% quote[\u65e5] 800µs ± 0% quote[\U0001f64f] 1.06ms ± 1% stdquote 1.17µs ± 0% unquote[a] 1.33ms ± 1% unquote[\u03b1] 952µs ± 2% unquote[\u65e5] 613µs ± 2% unquote[\U0001f64f] 3.62ms ± 1% stdunquote 788ns ± 0% i.e. on py2 quoting is ~ 1000x slower than builtin string escaping, and unquoting is even slower. on py3 the situation is better, but still not good: name time/op quote[a] 579µs ± 1% quote[\u03b1] 942µs ± 1% quote[\u65e5] 595µs ± 0% quote[\U0001f64f] 274µs ± 1% stdquote 2.70µs ± 0% unquote[a] 696µs ± 1% unquote[\u03b1] 763µs ± 0% unquote[\u65e5] 474µs ± 1% unquote[\U0001f64f] 187µs ± 0% stdunquote 808ns ± 0% δ(py2, py3) for the reference: name py2 time/op py3 time/op delta quote[a] 910µs ± 0% 579µs ± 1% -36.42% (p=0.008 n=5+5) quote[\u03b1] 1.23ms ± 0% 0.94ms ± 1% -23.17% (p=0.008 n=5+5) quote[\u65e5] 800µs ± 0% 595µs ± 0% -25.63% (p=0.016 n=4+5) quote[\U0001f64f] 1.06ms ± 1% 0.27ms ± 1% -74.23% (p=0.008 n=5+5) stdquote 1.17µs ± 0% 2.70µs ± 0% +129.71% (p=0.008 n=5+5) unquote[a] 1.33ms ± 1% 0.70ms ± 1% -47.71% (p=0.008 n=5+5) unquote[\u03b1] 952µs ± 2% 763µs ± 0% -19.82% (p=0.008 n=5+5) unquote[\u65e5] 613µs ± 2% 474µs ± 1% -22.76% (p=0.008 n=5+5) unquote[\U0001f64f] 3.62ms ± 1% 0.19ms ± 0% -94.84% (p=0.016 n=5+4) stdunquote 788ns ± 0% 808ns ± 0% +2.59% (p=0.016 n=4+5)
90f0e0ff · Kirill Smelkov · baf84437 · 90f0e0ff
Commit 90f0e0ff authored Jun 23, 2023 by Kirill Smelkov
Show whitespace changes
Inline Side-by-side

Showing with 54 additions and 2 deletions

golang/strconv_test.py golang/strconv_test.py +54 -2

No files found.
--- a/golang/strconv_test.py
+++ b/golang/strconv_test.py
 # -*- coding: utf-8 -*-
-# Copyright (C) 2018-2022  Nexedi SA and Contributors.
+# Copyright (C) 2018-2023  Nexedi SA and Contributors.
 #                          Kirill Smelkov <kirr@nexedi.com>
 #
 # This program is free software: you can Use, Study, Modify and Redistribute
@@ -26,7 +26,10 @@ from golang.gcompat import qq

 from six import int2byte as bchr
 from six.moves import range as xrange
-from pytest import raises
+from pytest import raises, mark
+
+import codecs
+

 def byterange(start, stop):
    b = b""
@@ -138,3 +141,52 @@ def test_unquote_bad():
        with raises(ValueError) as exc:
            unquote(tin)
        assert exc.value.args == (err,)
+
+
+# ---- benchmarks ----
+
+# quoting + unquoting
+uchar_testv = ['a',               # ascii
+               u'α',              # 2-bytes utf8
+               u'\u65e5',         # 3-bytes utf8
+               u'\U0001f64f']     # 4-bytes utf8
+
+@mark.parametrize('ch', uchar_testv)
+def bench_quote(b, ch):
+    s = bstr_ch1000(ch)
+    q = quote
+    for i in xrange(b.N):
+        q(s)
+
+def bench_stdquote(b):
+    s = b'a'*1000
+    q = repr
+    for i in xrange(b.N):
+        q(s)
+
+
+@mark.parametrize('ch', uchar_testv)
+def bench_unquote(b, ch):
+    s = bstr_ch1000(ch)
+    s = quote(s)
+    unq = unquote
+    for i in xrange(b.N):
+        unq(s)
+
+def bench_stdunquote(b):
+    s = b'"' + b'a'*1000 + b'"'
+    escape_decode = codecs.escape_decode
+    def unq(s): return escape_decode(s[1:-1])[0]
+    for i in xrange(b.N):
+        unq(s)
+
+
+# bstr_ch1000 returns bstr with many repetitions of character ch occupying ~ 1000 bytes.
+def bstr_ch1000(ch): # -> bstr
+    assert len(ch) == 1
+    s = bstr(ch)
+    s = s * (1000 // len(s))
+    if len(s) % 3 == 0:
+        s += 'x'
+    assert len(s) == 1000
+    return s