Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cpython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
cpython
Commits
07985ef3
Commit
07985ef3
authored
Jan 25, 2015
by
Serhiy Storchaka
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Issue #22286: The "backslashreplace" error handlers now works with
decoding and translating.
parent
58f02019
Changes
10
Show whitespace changes
Inline
Side-by-side
Showing
10 changed files
with
196 additions
and
83 deletions
+196
-83
Doc/howto/unicode.rst
Doc/howto/unicode.rst
+5
-2
Doc/library/codecs.rst
Doc/library/codecs.rst
+9
-5
Doc/library/functions.rst
Doc/library/functions.rst
+2
-3
Doc/library/io.rst
Doc/library/io.rst
+6
-5
Doc/whatsnew/3.5.rst
Doc/whatsnew/3.5.rst
+3
-1
Lib/codecs.py
Lib/codecs.py
+6
-3
Lib/test/test_codeccallbacks.py
Lib/test/test_codeccallbacks.py
+15
-11
Lib/test/test_codecs.py
Lib/test/test_codecs.py
+56
-0
Misc/NEWS
Misc/NEWS
+3
-0
Python/codecs.c
Python/codecs.c
+91
-53
No files found.
Doc/howto/unicode.rst
View file @
07985ef3
...
...
@@ -280,8 +280,9 @@ and optionally an *errors* argument.
The *errors* argument specifies the response when the input string can't be
converted according to the encoding's rules. Legal values for this argument are
``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the
character out of the Unicode result).
``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
character out of the Unicode result), or ``'backslashreplace'`` (inserts a
``\xNN`` escape sequence).
The following examples show the differences::
>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
...
...
@@ -291,6 +292,8 @@ The following examples show the differences::
invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'
...
...
Doc/library/codecs.rst
View file @
07985ef3
...
...
@@ -314,8 +314,8 @@ The following error handlers are only applicable to
|
| reference (only for encoding). Implemented |
|
| in :func:`xmlcharrefreplace_errors`. |
+-------------------------+-----------------------------------------------+
|
``'backslashreplace'`` | Replace with backslashed escape sequences
|
|
|
(only for encoding). Implemented in
|
|
``'backslashreplace'`` | Replace with backslashed escape sequences
.
|
|
|
Implemented in
|
|
| :func:`backslashreplace_errors`. |
+-------------------------+-----------------------------------------------+
|
``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
...
...
@@ -350,6 +350,10 @@ In addition, the following error handler is specific to the given codecs:
.. versionadded:: 3.5
The ``'namereplace'`` error handler.
.. versionchanged:: 3.5
The ``'backslashreplace'`` error handlers now works with decoding and
translating.
The set of allowed values can be extended by registering a new named error
handler:
...
...
@@ -417,9 +421,9 @@ functions:
..
function:: backslashreplace_errors(exception)
Implements the ``'backslashreplace'`` error handling (for
encoding with
:term:`text encodings <text encoding>` only):
the
unencodable character is
replaced by a backslashed escape sequence.
Implements the ``'backslashreplace'`` error handling (for
:term:`text encodings <text encoding>` only):
malformed data is
replaced by a backslashed escape sequence.
.. function:: namereplace_errors(exception)
...
...
Doc/library/functions.rst
View file @
07985ef3
...
...
@@ -973,9 +973,8 @@ are always available. They are listed here in alphabetical order.
Characters not supported by the encoding are replaced with the
appropriate XML character reference ``&#nnn;``.
* ``'backslashreplace'`` (also only supported when writing)
replaces unsupported characters with Python's backslashed escape
sequences.
* ``'backslashreplace'`` replaces malformed data by Python's backslashed
escape sequences.
* ``'namereplace'`` (also only supported when writing)
replaces unsupported characters with ``\N{...}`` escape sequences.
...
...
Doc/library/io.rst
View file @
07985ef3
...
...
@@ -825,11 +825,12 @@ Text I/O
exception if there is an encoding error (the default of ``None`` has the same
effect), or pass ``'ignore'`` to ignore errors. (Note that ignoring encoding
errors can lead to data loss.) ``'replace'`` causes a replacement marker
(such as ``'?'``) to be inserted where there is malformed data. When
writing, ``'xmlcharrefreplace'`` (replace with the appropriate XML character
reference), ``'backslashreplace'`` (replace with backslashed escape
sequences) or ``'namereplace'`` (replace with ``\N{...}`` escape sequences)
can be used. Any other error handling name that has been registered with
(such as ``'?'``) to be inserted where there is malformed data.
``'backslashreplace'`` causes malformed data to be replaced by a
backslashed escape sequence. When writing, ``'xmlcharrefreplace'``
(replace with the appropriate XML character reference) or ``'namereplace'``
(replace with ``\N{...}`` escape sequences) can be used. Any other error
handling name that has been registered with
:func:`codecs.register_error` is also valid.
.. index::
...
...
Doc/whatsnew/3.5.rst
View file @
07985ef3
...
...
@@ -118,7 +118,9 @@ Other Language Changes
Some smaller changes made to the core Python language are:
* None yet.
* Added the ``'namereplace'`` error handlers. The ``'backslashreplace'``
error handlers now works with decoding and translating.
(Contributed by Serhiy Storchaka in :issue:`19676` and :issue:`22286`.)
...
...
Lib/codecs.py
View file @
07985ef3
...
...
@@ -127,7 +127,8 @@ class Codec:
'surrogateescape' - replace with private code points U+DCnn.
'xmlcharrefreplace' - Replace with the appropriate XML
character reference (only for encoding).
'backslashreplace' - Replace with backslashed escape sequences
'backslashreplace' - Replace with backslashed escape sequences.
'namereplace' - Replace with
\
\
N{...} escape sequences
(only for encoding).
The set of allowed values can be extended via register_error.
...
...
@@ -359,7 +360,8 @@ class StreamWriter(Codec):
'xmlcharrefreplace' - Replace with the appropriate XML
character reference.
'backslashreplace' - Replace with backslashed escape
sequences (only for encoding).
sequences.
'namereplace' - Replace with
\
\
N{...} escape sequences.
The set of allowed parameter values can be extended via
register_error.
...
...
@@ -429,7 +431,8 @@ class StreamReader(Codec):
'strict' - raise a ValueError (or a subclass)
'ignore' - ignore the character and continue with the next
'replace'- replace with a suitable replacement character;
'replace'- replace with a suitable replacement character
'backslashreplace' - Replace with backslashed escape sequences;
The set of allowed parameter values can be extended via
register_error.
...
...
Lib/test/test_codeccallbacks.py
View file @
07985ef3
...
...
@@ -246,6 +246,11 @@ class CodecCallbackTest(unittest.TestCase):
"
\
u0000
\
ufffd
"
)
self
.
assertEqual
(
b"
\
x00
\
x00
\
x00
\
x00
\
x00
"
.
decode
(
"unicode-internal"
,
"backslashreplace"
),
"
\
u0000
\
\
x00"
)
codecs
.
register_error
(
"test.hui"
,
handler_unicodeinternal
)
self
.
assertEqual
(
...
...
@@ -565,17 +570,6 @@ class CodecCallbackTest(unittest.TestCase):
codecs
.
backslashreplace_errors
,
UnicodeError
(
"ouch"
)
)
# "backslashreplace" can only be used for encoding
self
.
assertRaises
(
TypeError
,
codecs
.
backslashreplace_errors
,
UnicodeDecodeError
(
"ascii"
,
bytearray
(
b"
\
xff
"
),
0
,
1
,
"ouch"
)
)
self
.
assertRaises
(
TypeError
,
codecs
.
backslashreplace_errors
,
UnicodeTranslateError
(
"
\
u3042
"
,
0
,
1
,
"ouch"
)
)
# Use the correct exception
self
.
assertEqual
(
codecs
.
backslashreplace_errors
(
...
...
@@ -701,6 +695,16 @@ class CodecCallbackTest(unittest.TestCase):
UnicodeEncodeError
(
"ascii"
,
"
\
udfff
"
,
0
,
1
,
"ouch"
)),
(
"
\
\
udfff"
,
1
)
)
self
.
assertEqual
(
codecs
.
backslashreplace_errors
(
UnicodeDecodeError
(
"ascii"
,
bytearray
(
b"
\
xff
"
),
0
,
1
,
"ouch"
)),
(
"
\
\
xff"
,
1
)
)
self
.
assertEqual
(
codecs
.
backslashreplace_errors
(
UnicodeTranslateError
(
"
\
u3042
"
,
0
,
1
,
"ouch"
)),
(
"
\
\
u3042"
,
1
)
)
def
test_badhandlerresults
(
self
):
results
=
(
42
,
"foo"
,
(
1
,
2
,
3
),
(
"foo"
,
1
,
3
),
(
"foo"
,
None
),
(
"foo"
,),
(
"foo"
,
1
,
3
),
(
"foo"
,
None
),
(
"foo"
,)
)
...
...
Lib/test/test_codecs.py
View file @
07985ef3
...
...
@@ -378,6 +378,10 @@ class ReadTest(MixInCheckStateHandling):
before
+
after
)
self
.
assertEqual
(
test_sequence
.
decode
(
self
.
encoding
,
"replace"
),
before
+
self
.
ill_formed_sequence_replace
+
after
)
backslashreplace
=
''
.
join
(
'
\
\
x%02x'
%
b
for
b
in
self
.
ill_formed_sequence
)
self
.
assertEqual
(
test_sequence
.
decode
(
self
.
encoding
,
"backslashreplace"
),
before
+
backslashreplace
+
after
)
class
UTF32Test
(
ReadTest
,
unittest
.
TestCase
):
encoding
=
"utf-32"
...
...
@@ -1300,14 +1304,19 @@ class UnicodeInternalTest(unittest.TestCase):
"unicode_internal"
)
if
sys
.
byteorder
==
"little"
:
invalid
=
b"
\
x00
\
x00
\
x11
\
x00
"
invalid_backslashreplace
=
r"\x00\x00\x11\x00"
else
:
invalid
=
b"
\
x00
\
x11
\
x00
\
x00
"
invalid_backslashreplace
=
r"\x00\x11\x00\x00"
with
support
.
check_warnings
():
self
.
assertRaises
(
UnicodeDecodeError
,
invalid
.
decode
,
"unicode_internal"
)
with
support
.
check_warnings
():
self
.
assertEqual
(
invalid
.
decode
(
"unicode_internal"
,
"replace"
),
'
\
ufffd
'
)
with
support
.
check_warnings
():
self
.
assertEqual
(
invalid
.
decode
(
"unicode_internal"
,
"backslashreplace"
),
invalid_backslashreplace
)
@
unittest
.
skipUnless
(
SIZEOF_WCHAR_T
==
4
,
'specific to 32-bit wchar_t'
)
def
test_decode_error_attributes
(
self
):
...
...
@@ -2042,6 +2051,16 @@ class CharmapTest(unittest.TestCase):
(
"ab
\
ufffd
"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"backslashreplace"
,
"ab"
),
(
"ab
\
\
x02"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"backslashreplace"
,
"ab
\
ufffe
"
),
(
"ab
\
\
x02"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"ignore"
,
"ab"
),
(
"ab"
,
3
)
...
...
@@ -2118,6 +2137,25 @@ class CharmapTest(unittest.TestCase):
(
"ab
\
ufffd
"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"backslashreplace"
,
{
0
:
'a'
,
1
:
'b'
}),
(
"ab
\
\
x02"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"backslashreplace"
,
{
0
:
'a'
,
1
:
'b'
,
2
:
None
}),
(
"ab
\
\
x02"
,
3
)
)
# Issue #14850
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"backslashreplace"
,
{
0
:
'a'
,
1
:
'b'
,
2
:
'
\
ufffe
'
}),
(
"ab
\
\
x02"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"ignore"
,
{
0
:
'a'
,
1
:
'b'
}),
...
...
@@ -2194,6 +2232,18 @@ class CharmapTest(unittest.TestCase):
(
"ab
\
ufffd
"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"backslashreplace"
,
{
0
:
a
,
1
:
b
}),
(
"ab
\
\
x02"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"backslashreplace"
,
{
0
:
a
,
1
:
b
,
2
:
0xFFFE
}),
(
"ab
\
\
x02"
,
3
)
)
self
.
assertEqual
(
codecs
.
charmap_decode
(
b"
\
x00
\
x01
\
x02
"
,
"ignore"
,
{
0
:
a
,
1
:
b
}),
...
...
@@ -2253,9 +2303,13 @@ class TypesTest(unittest.TestCase):
self
.
assertRaises
(
UnicodeDecodeError
,
codecs
.
unicode_escape_decode
,
br"\U00110000"
)
self
.
assertEqual
(
codecs
.
unicode_escape_decode
(
r"\U00110000"
,
"replace"
),
(
"
\
ufffd
"
,
10
))
self
.
assertEqual
(
codecs
.
unicode_escape_decode
(
r"\U00110000"
,
"backslashreplace"
),
(
r"\x5c\x55\x30\x30\x31\x31\x30\x30\x30\x30"
,
10
))
self
.
assertRaises
(
UnicodeDecodeError
,
codecs
.
raw_unicode_escape_decode
,
br"\U00110000"
)
self
.
assertEqual
(
codecs
.
raw_unicode_escape_decode
(
r"\U00110000"
,
"replace"
),
(
"
\
ufffd
"
,
10
))
self
.
assertEqual
(
codecs
.
raw_unicode_escape_decode
(
r"\U00110000"
,
"backslashreplace"
),
(
r"\x5c\x55\x30\x30\x31\x31\x30\x30\x30\x30"
,
10
))
class
UnicodeEscapeTest
(
unittest
.
TestCase
):
...
...
@@ -2894,11 +2948,13 @@ class CodePageTest(unittest.TestCase):
(b'[
\
xff
]', 'strict', None),
(b'[
\
xff
]', 'ignore', '[]'),
(b'[
\
xff
]', 'replace', '[
\
ufffd
]'),
(b'[
\
xff
]', 'backslashreplace', '[
\
\
xff]'),
(b'[
\
xff
]', 'surrogateescape', '[
\
udcff
]'),
(b'[
\
xff
]', 'surrogatepass', None),
(b'
\
x81
\
x00
abc', 'strict', None),
(b'
\
x81
\
x00
abc', 'ignore', '
\
x00
abc'),
(b'
\
x81
\
x00
abc', 'replace', '
\
ufffd
\
x00
abc'),
(b'
\
x81
\
x00
abc', 'backslashreplace', '
\
\
xff
\
x00
abc'),
))
def test_cp1252(self):
...
...
Misc/NEWS
View file @
07985ef3
...
...
@@ -10,6 +10,9 @@ Release date: TBA
Core and Builtins
-----------------
- Issue #22286: The "backslashreplace" error handlers now works with
decoding and translating.
- Issue #23253: Delay-load ShellExecute[AW] in os.startfile for reduced
startup overhead on Windows.
...
...
Python/codecs.c
View file @
07985ef3
...
...
@@ -864,22 +864,66 @@ PyObject *PyCodec_XMLCharRefReplaceErrors(PyObject *exc)
PyObject
*
PyCodec_BackslashReplaceErrors
(
PyObject
*
exc
)
{
if
(
PyObject_IsInstance
(
exc
,
PyExc_UnicodeEncodeError
))
{
PyObject
*
restuple
;
PyObject
*
object
;
Py_ssize_t
i
;
Py_ssize_t
start
;
Py_ssize_t
end
;
PyObject
*
res
;
unsigned
char
*
outp
;
Py_ssize_
t
ressize
;
in
t
ressize
;
Py_UCS4
c
;
if
(
PyObject_IsInstance
(
exc
,
PyExc_UnicodeDecodeError
))
{
unsigned
char
*
p
;
if
(
PyUnicodeDecodeError_GetStart
(
exc
,
&
start
))
return
NULL
;
if
(
PyUnicodeDecodeError_GetEnd
(
exc
,
&
end
))
return
NULL
;
if
(
!
(
object
=
PyUnicodeDecodeError_GetObject
(
exc
)))
return
NULL
;
if
(
!
(
p
=
(
unsigned
char
*
)
PyBytes_AsString
(
object
)))
{
Py_DECREF
(
object
);
return
NULL
;
}
res
=
PyUnicode_New
(
4
*
(
end
-
start
),
127
);
if
(
res
==
NULL
)
{
Py_DECREF
(
object
);
return
NULL
;
}
outp
=
PyUnicode_1BYTE_DATA
(
res
);
for
(
i
=
start
;
i
<
end
;
i
++
,
outp
+=
4
)
{
unsigned
char
c
=
p
[
i
];
outp
[
0
]
=
'\\'
;
outp
[
1
]
=
'x'
;
outp
[
2
]
=
Py_hexdigits
[(
c
>>
4
)
&
0xf
];
outp
[
3
]
=
Py_hexdigits
[
c
&
0xf
];
}
assert
(
_PyUnicode_CheckConsistency
(
res
,
1
));
Py_DECREF
(
object
);
return
Py_BuildValue
(
"(Nn)"
,
res
,
end
);
}
if
(
PyObject_IsInstance
(
exc
,
PyExc_UnicodeEncodeError
))
{
if
(
PyUnicodeEncodeError_GetStart
(
exc
,
&
start
))
return
NULL
;
if
(
PyUnicodeEncodeError_GetEnd
(
exc
,
&
end
))
return
NULL
;
if
(
!
(
object
=
PyUnicodeEncodeError_GetObject
(
exc
)))
return
NULL
;
}
else
if
(
PyObject_IsInstance
(
exc
,
PyExc_UnicodeTranslateError
))
{
if
(
PyUnicodeTranslateError_GetStart
(
exc
,
&
start
))
return
NULL
;
if
(
PyUnicodeTranslateError_GetEnd
(
exc
,
&
end
))
return
NULL
;
if
(
!
(
object
=
PyUnicodeTranslateError_GetObject
(
exc
)))
return
NULL
;
}
else
{
wrong_exception_type
(
exc
);
return
NULL
;
}
if
(
end
-
start
>
PY_SSIZE_T_MAX
/
(
1
+
1
+
8
))
end
=
start
+
PY_SSIZE_T_MAX
/
(
1
+
1
+
8
);
for
(
i
=
start
,
ressize
=
0
;
i
<
end
;
++
i
)
{
...
...
@@ -899,8 +943,8 @@ PyObject *PyCodec_BackslashReplaceErrors(PyObject *exc)
Py_DECREF
(
object
);
return
NULL
;
}
for
(
i
=
start
,
outp
=
PyUnicode_1BYTE_DATA
(
res
);
i
<
end
;
++
i
)
{
outp
=
PyUnicode_1BYTE_DATA
(
res
);
for
(
i
=
start
;
i
<
end
;
++
i
)
{
c
=
PyUnicode_READ_CHAR
(
object
,
i
);
*
outp
++
=
'\\'
;
if
(
c
>=
0x00010000
)
{
...
...
@@ -924,14 +968,8 @@ PyObject *PyCodec_BackslashReplaceErrors(PyObject *exc)
}
assert
(
_PyUnicode_CheckConsistency
(
res
,
1
));
restuple
=
Py_BuildValue
(
"(Nn)"
,
res
,
end
);
Py_DECREF
(
object
);
return
restuple
;
}
else
{
wrong_exception_type
(
exc
);
return
NULL
;
}
return
Py_BuildValue
(
"(Nn)"
,
res
,
end
);
}
static
_PyUnicode_Name_CAPI
*
ucnhash_CAPI
=
NULL
;
...
...
@@ -1444,8 +1482,8 @@ static int _PyCodecRegistry_Init(void)
backslashreplace_errors
,
METH_O
,
PyDoc_STR
(
"Implements the 'backslashreplace' error handling, "
"which replaces
an unencodable character with a
"
"
backslashed
escape sequence."
)
"which replaces
malformed data with a backslashed
"
"escape sequence."
)
}
},
{
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment