Commit 598eb479 authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str,strconv: Fix decoding of rune-error

Error rune (u+fffd) is returned by _utf8_decode_rune to indicate an
error in decoding. But the error rune itself is valid unicode codepoint:

   >>> x = u"�"
   >>> x
   u'\ufffd'
   >>> x.encode('utf-8')
   '\xef\xbf\xbd'

This way only (r=_rune_error, size=1) should be treated by the caller as
utf8 decoding error.

But e.g. strconv.quote was not careful to also inspect the size, and this way
was quoting � into just "\xef" instead of "\xef\xbf\xbd".
_utf8_decode_surrogateescape was also subject to similar error.

-> Fix it.

Without the fix e.g. added test for strconv.quote fails as

    >           assert quote(tin) == tquoted
    E           assert '"\xef"' == '"�"'
    E             - "\xef"
    E             + "�"

/reviewed-by @jerome
/reviewed-at nexedi/pygolang!18
parent ea5abe71
Pipeline #23895 passed with stage
in 0 seconds
......@@ -242,7 +242,7 @@ def _utf8_decode_surrogateescape(s): # -> unicode
while len(s) > 0:
r, width = _utf8_decode_rune(s)
if r == _rune_error:
if r == _rune_error and width == 1:
b = ord(s[0])
assert 0x80 <= b <= 0xff
emit(unichr(0xdc00 + b))
......
......@@ -75,6 +75,9 @@ def test_strings():
# some characters with U >= 0x10000
(b'\xf0\x9f\x99\x8f', u'\U0001f64f'), # 🙏
(b'\xf0\x9f\x9a\x80', u'\U0001f680'), # 🚀
# invalid rune
(b'\xef\xbf\xbd', u'�'),
)
for tbytes, tunicode in testv:
......
......@@ -83,7 +83,7 @@ def _quote(s):
isize = i + size
# decode error - just emit raw byte as escaped
if r == _rune_error:
if r == _rune_error and size == 1:
emit(br'\x%02x' % ord(c))
# printable utf-8 characters go as is
......
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2021 Nexedi SA and Contributors.
# Copyright (C) 2018-2022 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -67,6 +67,9 @@ def test_quote():
# non-printable utf-8
(u"\u007f\u0080\u0081\u0082\u0083\u0084\u0085\u0086\u0087", u"\\x7f\\xc2\\x80\\xc2\\x81\\xc2\\x82\\xc2\\x83\\xc2\\x84\\xc2\\x85\\xc2\\x86\\xc2\\x87"),
# invalid rune
(u'\ufffd', u'�'),
)
for tin, tquoted in testv:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment