Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cpython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
cpython
Commits
30c78d6d
Commit
30c78d6d
authored
May 11, 2008
by
Georg Brandl
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Update the codecs docs w.r.t. str/bytes.
parent
20a046cc
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
63 additions
and
78 deletions
+63
-78
Doc/library/codecs.rst
Doc/library/codecs.rst
+63
-78
No files found.
Doc/library/codecs.rst
View file @
30c78d6d
...
@@ -207,15 +207,14 @@ utility functions:
...
@@ -207,15 +207,14 @@ utility functions:
.. note::
.. note::
The wrapped version will only accept the object format defined by the codecs,
The wrapped version's methods will accept and return strings only. Bytes
i.e. Unicode objects for most built-in codecs. Output is also codec-dependent
arguments will be rejected.
and will usually be Unicode as well.
.. note::
.. note::
Files are always opened in binary mode, even if no binary mode was
Files are always opened in binary mode, even if no binary mode was
specified. This is done to avoid data loss due to encodings using 8-bit
specified. This is done to avoid data loss due to encodings using 8-bit
values. This means that no automatic conversion of ``'\n'`` is done
values. This means that no automatic conversion of ``
b
'\n'`` is done
on reading and writing.
on reading and writing.
*encoding* specifies the encoding which is to be used for the file.
*encoding* specifies the encoding which is to be used for the file.
...
@@ -232,10 +231,9 @@ utility functions:
...
@@ -232,10 +231,9 @@ utility functions:
Return a wrapped version of file which provides transparent encoding
Return a wrapped version of file which provides transparent encoding
translation.
translation.
Strings written to the wrapped file are interpreted according to the given
Bytes written to the wrapped file are interpreted according to the given
*input* encoding and then written to the original file as strings using the
*input* encoding and then written to the original file as bytes using the
*output* encoding. The intermediate encoding will usually be Unicode but depends
*output* encoding.
on the specified codecs.
If *output* is not given, it defaults to *input*.
If *output* is not given, it defaults to *input*.
...
@@ -338,8 +336,7 @@ interfaces of the stateless encoder and decoder:
...
@@ -338,8 +336,7 @@ interfaces of the stateless encoder and decoder:
..
method:: Codec.encode(input[, errors])
..
method:: Codec.encode(input[, errors])
Encodes the object *input* and returns a tuple (output object, length consumed).
Encodes the object *input* and returns a tuple (output object, length consumed).
While codecs are not restricted to use with Unicode, in a Unicode context,
Encoding converts a string object to a bytes object using a particular
encoding converts a Unicode object to a plain string using a particular
character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
*errors* defines the error handling to apply. It defaults to ``'strict'``
*errors* defines the error handling to apply. It defaults to ``'strict'``
...
@@ -355,13 +352,12 @@ interfaces of the stateless encoder and decoder:
...
@@ -355,13 +352,12 @@ interfaces of the stateless encoder and decoder:
..
method:: Codec.decode(input[, errors])
..
method:: Codec.decode(input[, errors])
Decodes the object *input* and returns a tuple (output object, length
consumed).
Decodes the object *input* and returns a tuple (output object, length
In a Unicode context, decoding converts a plain string encoded using a
consumed). Decoding converts a bytes object encoded using a particular
particular character set encoding to a Unicode
object.
character set encoding to a string
object.
*input* must be an object which provides the ``bf_getreadbuf`` buffer slot.
*input* must be a bytes object or one which provides the read-only character
Python strings, buffer objects and memory mapped files are examples of objects
buffer interface -- for example, buffer objects and memory mapped files.
providing this slot.
*errors* defines the error handling to apply. It defaults to ``'strict'``
*errors* defines the error handling to apply. It defaults to ``'strict'``
handling.
handling.
...
@@ -746,9 +742,7 @@ The design is such that one can use the factory functions returned by the
...
@@ -746,9 +742,7 @@ The design is such that one can use the factory functions returned by the
:class:`StreamReader` and :class:`StreamWriter` interface respectively.
:class:`StreamReader` and :class:`StreamWriter` interface respectively.
*encode* and *decode* are needed for the frontend translation, *Reader* and
*encode* and *decode* are needed for the frontend translation, *Reader* and
*Writer* for the backend translation. The intermediate format used is
*Writer* for the backend translation.
determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode
as the intermediate encoding.
Error handling is done in the same way as defined for the stream readers and
Error handling is done in the same way as defined for the stream readers and
writers.
writers.
...
@@ -764,32 +758,32 @@ methods and attributes from the underlying stream.
...
@@ -764,32 +758,32 @@ methods and attributes from the underlying stream.
Encodings and Unicode
Encodings and Unicode
---------------------
---------------------
Unicode s
trings are stored internally as sequences of codepoints (to be precise
S
trings are stored internally as sequences of codepoints (to be precise
as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
via :option:`--without-wide-unicode` or :option:`--with-wide-unicode`, with the
via :option:`--without-wide-unicode` or :option:`--with-wide-unicode`, with the
former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data
former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data
type. Once a
Unicode
object is used outside of CPU and memory, CPU endianness
type. Once a
string
object is used outside of CPU and memory, CPU endianness
and how these arrays are stored as bytes become an issue. Transforming a
and how these arrays are stored as bytes become an issue. Transforming a
unicode
object into a sequence of bytes is called encoding and recreating the
string
object into a sequence of bytes is called encoding and recreating the
unicode
object from the sequence of bytes is known as decoding. There are many
string
object from the sequence of bytes is known as decoding. There are many
different methods for how this transformation can be done (these methods are
different methods for how this transformation can be done (these methods are
also called encodings). The simplest method is to map the codepoints 0-255 to
also called encodings). The simplest method is to map the codepoints 0-255 to
the bytes ``0x0``-``0xff``. This means that a
unicode
object that contains
the bytes ``0x0``-``0xff``. This means that a
string
object that contains
codepoints above ``U+00FF`` can't be encoded with this method (which is called
codepoints above ``U+00FF`` can't be encoded with this method (which is called
``'latin-1'``
or ``'iso-8859-1'``). :func:`
unicode
.encode` will raise a
``'latin-1'``
or ``'iso-8859-1'``). :func:`
str
.encode` will raise a
:exc:`UnicodeEncodeError`
that looks like this: ``UnicodeEncodeError: 'latin-1'
:exc:`UnicodeEncodeError`
that looks like this: ``UnicodeEncodeError: 'latin-1'
codec can't encode character
u
'\u1234' in position 3: ordinal not in
codec can't encode character '\u1234' in position 3: ordinal not in
range(256)``.
range(256)``.
There's another group of encodings (the so called charmap encodings) that choose
There's another group of encodings (the so called charmap encodings) that choose
a different subset of all
u
nicode code points and how these codepoints are
a different subset of all
U
nicode code points and how these codepoints are
mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
Windows). There's a string constant with 256 characters that shows you which
Windows). There's a string constant with 256 characters that shows you which
character is mapped to which byte value.
character is mapped to which byte value.
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
defined in
u
nicode. A simple and straightforward way that can store each Unicode
defined in
U
nicode. A simple and straightforward way that can store each Unicode
code point, is to store each codepoint as two consecutive bytes. There are two
code point, is to store each codepoint as two consecutive bytes. There are two
possibilities: Store the bytes in big endian or in little endian order. These
possibilities: Store the bytes in big endian or in little endian order. These
two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
...
@@ -810,7 +804,7 @@ With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
...
@@ -810,7 +804,7 @@ With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
it's a device to determine the storage layout of the encoded bytes, and vanishes
it's a device to determine the storage layout of the encoded bytes, and vanishes
once the byte sequence has been decoded into a
Unicode
string; as a ``ZERO WIDTH
once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
There's another encoding that is able to encoding the full range of Unicode
There's another encoding that is able to encoding the full range of Unicode
...
@@ -841,11 +835,11 @@ Unicode character):
...
@@ -841,11 +835,11 @@ Unicode character):
The least significant bit of the Unicode character is the rightmost x bit.
The least significant bit of the Unicode character is the rightmost x bit.
As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
the decoded
Unicode string (even if it's the first character) is treated as a
the decoded
string (even if it's the first character) is treated as a ``ZERO
``ZERO
WIDTH NO-BREAK SPACE``.
WIDTH NO-BREAK SPACE``.
Without external information it's impossible to reliably determine which
Without external information it's impossible to reliably determine which
encoding was used for encoding a
Unicode
string. Each charmap encoding can
encoding was used for encoding a string. Each charmap encoding can
decode any random byte sequence. However that's not possible with UTF-8, as
decode any random byte sequence. However that's not possible with UTF-8, as
UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
sequences. To increase the reliability with which a UTF-8 encoding can be
sequences. To increase the reliability with which a UTF-8 encoding can be
...
@@ -1096,54 +1090,45 @@ particular, the following variants typically exist:
...
@@ -1096,54 +1090,45 @@ particular, the following variants typically exist:
|
utf_8_sig | | all languages |
|
utf_8_sig | | all languages |
+-----------------+--------------------------------+--------------------------------+
+-----------------+--------------------------------+--------------------------------+
A number of codecs are specific to Python, so their codec names have no meaning
outside Python. Some of them don't convert from Unicode strings to byte strings,
but instead use the property of the Python codecs machinery that any bijective
function with one argument can be considered as an encoding.
For the codecs listed below, the result in the "encoding" direction is always a
byte string. The result of the "decoding" direction is listed as operand type in
the table.
.. XXX fix here, should be in above table
.. XXX fix here, should be in above table
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
Codec | Aliases |
Operand type |
Purpose |
|
Codec | Aliases | Purpose |
+====================+=========+================
+================
===========+
+====================+=========+===========================+
|
idna | |
Unicode string |
Implements :rfc:`3490`, |
|
idna | | Implements :rfc:`3490`, |
|
| |
|
see also |
|
| | see also |
|
| |
|
:mod:`encodings.idna` |
|
| | :mod:`encodings.idna` |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
mbcs | dbcs |
Unicode string |
Windows only: Encode |
|
mbcs | dbcs | Windows only: Encode |
|
| |
|
operand according to the |
|
| | operand according to the |
|
| |
|
ANSI codepage (CP_ACP) |
|
| | ANSI codepage (CP_ACP) |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
palmos | |
Unicode string |
Encoding of PalmOS 3.5 |
|
palmos | | Encoding of PalmOS 3.5 |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
punycode | |
Unicode string |
Implements :rfc:`3492` |
|
punycode | | Implements :rfc:`3492` |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
raw_unicode_escape | |
Unicode string |
Produce a string that is |
|
raw_unicode_escape | | Produce a string that is |
|
| |
|
suitable as raw Unicode |
|
| | suitable as raw Unicode |
|
| |
|
literal in Python source |
|
| | literal in Python source |
|
| |
|
code |
|
| | code |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
undefined | |
any |
Raise an exception for |
|
undefined | | Raise an exception for |
|
| |
|
all conversions. Can be |
|
| | all conversions. Can be |
|
| |
|
used as the system |
|
| | used as the system |
|
| |
|
encoding if no automatic |
|
| | encoding if no automatic |
|
| |
|
coercion between byte and |
|
| | coercion between byte and |
|
| |
|
Unicode strings is |
|
| | Unicode strings is |
|
| |
|
desired. |
|
| | desired. |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
unicode_escape | |
Unicode string |
Produce a string that is |
|
unicode_escape | | Produce a string that is |
|
| |
|
suitable as Unicode |
|
| | suitable as Unicode |
|
| |
|
literal in Python source |
|
| | literal in Python source |
|
| |
|
code |
|
| | code |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
|
unicode_internal | |
Unicode string |
Return the internal |
|
unicode_internal | | Return the internal |
|
| |
|
representation of the |
|
| | representation of the |
|
| |
|
operand |
|
| | operand |
+--------------------+---------+----------------
+----------------
-----------+
+--------------------+---------+---------------------------+
:mod:`encodings.idna`
--- Internationalized Domain Names in Applications
:mod:`encodings.idna`
--- Internationalized Domain Names in Applications
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment