Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cpython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
cpython
Commits
fbb39815
Commit
fbb39815
authored
Oct 25, 2011
by
Ezio Melotti
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Refactor a bit the codecs doc.
parent
963004d1
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
21 additions
and
19 deletions
+21
-19
Doc/library/codecs.rst
Doc/library/codecs.rst
+21
-19
No files found.
Doc/library/codecs.rst
View file @
fbb39815
...
...
@@ -810,27 +810,28 @@ e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
Windows). There's a string constant with 256 characters that shows you which
character is mapped to which byte value.
All of these encodings can only encode 256 of the
65536 (or 1114111)
codepoints
All of these encodings can only encode 256 of the
1114112
codepoints
defined in Unicode. A simple and straightforward way that can store each Unicode
code point, is to store each codepoint as
two
consecutive bytes. There are two
possibilities:
S
tore the bytes in big endian or in little endian order. These
two encodings are called
UTF-16-BE and UTF-16-LE
respectively. Their
disadvantage is that if e.g. you use
UTF-16-BE
on a little endian machine you
will always have to swap bytes on encoding and decoding.
UTF-16
avoids this
problem:
B
ytes will always be in natural endianness. When these bytes are read
code point, is to store each codepoint as
four
consecutive bytes. There are two
possibilities:
s
tore the bytes in big endian or in little endian order. These
two encodings are called
``UTF-32-BE`` and ``UTF-32-LE``
respectively. Their
disadvantage is that if e.g. you use
``UTF-32-BE``
on a little endian machine you
will always have to swap bytes on encoding and decoding.
``UTF-32``
avoids this
problem:
b
ytes will always be in natural endianness. When these bytes are read
by a CPU with a different endianness, then bytes have to be swapped though. To
be able to detect the endianness of a UTF-16 byte sequence, there's the so
called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
This character will be prepended to every UTF-16 byte sequence. The byte swapped
version of this character (``0xFFFE``) is an illegal character that may not
appear in a Unicode text. So when the first character in an UTF-16 byte sequence
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
``U+FEFF``.
This character can be prepended to every ``UTF-16`` or ``UTF-32``
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
illegal character that may not appear in a Unicode text. So when the
first character in an ``UTF-16`` or ``UTF-32`` byte sequence
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
Unfortunately
upto Unicode 4.0
the character ``U+FEFF`` had a second purpose as
a ``ZERO WIDTH NO-BREAK SPACE``:
A
character that has no width and doesn't allow
Unfortunately the character ``U+FEFF`` had a second purpose as
a ``ZERO WIDTH NO-BREAK SPACE``:
a
character that has no width and doesn't allow
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Unicode software still must be able to handle ``U+FEFF`` in both roles:
A
s a BOM
Unicode software still must be able to handle ``U+FEFF`` in both roles:
a
s a BOM
it's a device to determine the storage layout of the encoded bytes, and vanishes
once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
...
...
@@ -838,7 +839,7 @@ NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
There's another encoding that is able to encoding the full range of Unicode
characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
parts:
M
arker bits (the most significant bits) and payload bits. The marker bits
parts:
m
arker bits (the most significant bits) and payload bits. The marker bits
are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
encoded like this (with x being payload bits, which when concatenated give the
Unicode character):
...
...
@@ -877,13 +878,14 @@ map to
| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
| INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a
utf-8-sig
encoding can be
in iso-8859-1), this increases the probability that a
``utf-8-sig``
encoding can be
correctly guessed from the byte sequence. So here the BOM is not used to be able
to determine the byte order used for generating the byte sequence, but as a
signature that helps in guessing the encoding. On encoding the utf-8-sig codec
will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
decoding utf-8-sig will skip those three bytes if they appear as the first three
bytes in the file.
decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
three bytes in the file. In UTF-8, the use of the BOM is discouraged and
should generally be avoided.
..
_standard-encodings:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment