Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cpython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
cpython
Commits
131e4f71
Commit
131e4f71
authored
Jan 23, 2006
by
Georg Brandl
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Add markup to new section in codecs docs
parent
296152e6
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
42 additions
and
41 deletions
+42
-41
Doc/lib/libcodecs.tex
Doc/lib/libcodecs.tex
+42
-41
No files found.
Doc/lib/libcodecs.tex
View file @
131e4f71
...
@@ -525,9 +525,10 @@ all other methods and attribute from the underlying stream.
...
@@ -525,9 +525,10 @@ all other methods and attribute from the underlying stream.
\subsection
{
Encodings and Unicode
\label
{
encodings-overview
}}
\subsection
{
Encodings and Unicode
\label
{
encodings-overview
}}
Unicode strings are stored internally as sequences of codepoints (to
Unicode strings are stored internally as sequences of codepoints (to
be precise as Py
_
UNICODE arrays). Depending on the way Python is
be precise as
\ctype
{
Py
_
UNICODE
}
arrays). Depending on the way Python is
compiled (either via --enable-unicode=ucs2 or --enable-unicode=ucs4,
compiled (either via
\longprogramopt
{
enable-unicode=ucs2
}
or
with the former being the default) Py
_
UNICODE is either a 16-bit or
\longprogramopt
{
enable-unicode=ucs4
}
, with the former being the default)
\ctype
{
Py
_
UNICODE
}
is either a 16-bit or
32-bit data type. Once a Unicode object is used outside of CPU and
32-bit data type. Once a Unicode object is used outside of CPU and
memory, CPU endianness and how these arrays are stored as bytes become
memory, CPU endianness and how these arrays are stored as bytes become
an issue. Transforming a unicode object into a sequence of bytes is
an issue. Transforming a unicode object into a sequence of bytes is
...
@@ -535,20 +536,20 @@ called encoding and recreating the unicode object from the sequence of
...
@@ -535,20 +536,20 @@ called encoding and recreating the unicode object from the sequence of
bytes is known as decoding. There are many different methods how this
bytes is known as decoding. There are many different methods how this
transformation can be done (these methods are also called encodings).
transformation can be done (these methods are also called encodings).
The simplest method is to map the codepoints 0-255 to the bytes
The simplest method is to map the codepoints 0-255 to the bytes
0x0-0xff. This means that a unicode object that contains codepoints
\code
{
0x0
}
-
\code
{
0xff
}
. This means that a unicode object that contains
above U+00FF can't be encoded with this method (which is called
codepoints above
\code
{
U+00FF
}
can't be encoded with this method (which
'latin-1' or 'iso-8859-1'). unicode.encode() will raise a
is called
\code
{
'latin-1'
}
or
\code
{
'iso-8859-1'
}
). unicode.encode() will
UnicodeEncodeError that looks like this: UnicodeEncodeError: 'latin-1'
raise a UnicodeEncodeError that looks like this:
\samp
{
UnicodeEncodeError:
codec can't encode character u'
\u
1234' in position 3: ordinal not in
'latin-1' codec can't encode character u'
\e
u1234' in position 3: ordinal
range(256)
not in range(256)
}
.
There's another group of encodings (the so called charmap encodings)
There's another group of encodings (the so called charmap encodings)
that choose a different subset of all unicode code points and how
that choose a different subset of all unicode code points and how
these codepoints are mapped to the bytes
0x0-0xff. To see how this is
these codepoints are mapped to the bytes
\code
{
0x0
}
-
\code
{
0xff.
}
done simply open e.g. encodings/cp1252.py (which is an encoding that
To see how this is done simply open e.g.
\file
{
encodings/cp1252.py
}
is used primarily on Windows). There's a string constant with 256
(which is an encoding that is used primarily on Windows).
characters that shows you which character is mapped to which byte
There's a string constant with 256 characters that shows you which
value.
character is mapped to which byte
value.
All of these encodings can only encode 256 of the 65536 (or 1114111)
All of these encodings can only encode 256 of the 65536 (or 1114111)
codepoints defined in unicode. A simple and straightforward way that
codepoints defined in unicode. A simple and straightforward way that
...
@@ -562,20 +563,20 @@ Bytes will always be in natural endianness. When these bytes are read
...
@@ -562,20 +563,20 @@ Bytes will always be in natural endianness. When these bytes are read
by a CPU with a different endianness, then bytes have to be swapped
by a CPU with a different endianness, then bytes have to be swapped
though. To be able to detect the endianness of a UTF-16 byte sequence,
though. To be able to detect the endianness of a UTF-16 byte sequence,
there's the so called BOM (the "Byte Order Mark"). This is the Unicode
there's the so called BOM (the "Byte Order Mark"). This is the Unicode
character
U+FEFF
. This character will be prepended to every UTF-16
character
\code
{
U+FEFF
}
. This character will be prepended to every UTF-16
byte sequence. The byte swapped version of this character (
0xFFFE
) is
byte sequence. The byte swapped version of this character (
\code
{
0xFFFE
}
) is
an illegal character that may not appear in a Unicode text. So when
an illegal character that may not appear in a Unicode text. So when
the first character in an UTF-16 byte sequence appears to be a
U+FFFE
the first character in an UTF-16 byte sequence appears to be a
\code
{
U+FFFE
}
the bytes have to be swapped on decoding. Unfortunately upto Unicode
the bytes have to be swapped on decoding. Unfortunately upto Unicode
4.0 the character
U+FEFF had a second purpose as a "
ZERO WIDTH
4.0 the character
\code
{
U+FEFF
}
had a second purpose as a
\samp
{
ZERO WIDTH
NO-BREAK SPACE
"
: A character that has no width and doesn't allow a
NO-BREAK SPACE
}
: A character that has no width and doesn't allow a
word to be split. It can e.g. be used to give hints to a ligature
word to be split. It can e.g. be used to give hints to a ligature
algorithm. With Unicode 4.0 using
U+FEFF as a
ZERO WIDTH NO-BREAK
algorithm. With Unicode 4.0 using
\code
{
U+FEFF
}
as a
\samp
{
ZERO WIDTH NO-BREAK
SPACE
has been deprecated (with U+2060 (WORD JOINER) assuming this
SPACE
}
has been deprecated (with
\code
{
U+2060
}
(
\samp
{
WORD JOINER
}
) assuming
role). Nevertheless Unicode software still must be able to handle
this
role). Nevertheless Unicode software still must be able to handle
U+FEFF
in both roles: As a BOM it's a device to determine the storage
\code
{
U+FEFF
}
in both roles: As a BOM it's a device to determine the storage
layout of the encoded bytes, and vanishes once the byte sequence has
layout of the encoded bytes, and vanishes once the byte sequence has
been decoded into a Unicode string; as a
ZERO WIDTH NO-BREAK SPACE
been decoded into a Unicode string; as a
\samp
{
ZERO WIDTH NO-BREAK SPACE
}
it's a normal character that will be decoded like any other.
it's a normal character that will be decoded like any other.
There's another encoding that is able to encoding the full range of
There's another encoding that is able to encoding the full range of
...
@@ -588,20 +589,20 @@ bits) and payload bits. The marker bits are a sequence of zero to six
...
@@ -588,20 +589,20 @@ bits) and payload bits. The marker bits are a sequence of zero to six
character):
character):
\begin{tableii}
{
l|l
}{
textrm
}{}{
Range
}{
Encoding
}
\begin{tableii}
{
l|l
}{
textrm
}{}{
Range
}{
Encoding
}
\lineii
{
U-00000000 ... U-0000007F
}{
0xxxxxxx
}
\lineii
{
\code
{
U-00000000
}
...
\code
{
U-0000007F
}
}{
0xxxxxxx
}
\lineii
{
U-00000080 ... U-000007FF
}{
110xxxxx 10xxxxxx
}
\lineii
{
\code
{
U-00000080
}
...
\code
{
U-000007FF
}
}{
110xxxxx 10xxxxxx
}
\lineii
{
U-00000800 ... U-0000FFFF
}{
1110xxxx 10xxxxxx 10xxxxxx
}
\lineii
{
\code
{
U-00000800
}
...
\code
{
U-0000FFFF
}
}{
1110xxxx 10xxxxxx 10xxxxxx
}
\lineii
{
U-00010000 ... U-001FFFFF
}{
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
}
\lineii
{
\code
{
U-00010000
}
...
\code
{
U-001FFFFF
}
}{
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
}
\lineii
{
U-00200000 ... U-03FFFFFF
}{
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
}
\lineii
{
\code
{
U-00200000
}
...
\code
{
U-03FFFFFF
}
}{
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
}
\lineii
{
U-04000000 ... U-7FFFFFFF
}{
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
}
\lineii
{
\code
{
U-04000000
}
...
\code
{
U-7FFFFFFF
}
}{
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
}
\end{tableii}
\end{tableii}
The least significant bit of the Unicode character is the rightmost x
The least significant bit of the Unicode character is the rightmost x
bit.
bit.
As UTF-8 is an 8bit encoding no BOM is required and any
U+FEFF
As UTF-8 is an 8bit encoding no BOM is required and any
\code
{
U+FEFF
}
character in the decoded Unicode string (even if it's the first
character in the decoded Unicode string (even if it's the first
character) is treated as a
ZERO WIDTH NO-BREAK SPACE
.
character) is treated as a
\samp
{
ZERO WIDTH NO-BREAK SPACE
}
.
Without external information it's impossible to reliably determine
Without external information it's impossible to reliably determine
which encoding was used for encoding a Unicode string. Each charmap
which encoding was used for encoding a Unicode string. Each charmap
...
@@ -609,14 +610,14 @@ encoding can decode any random byte sequence. However that's not
...
@@ -609,14 +610,14 @@ encoding can decode any random byte sequence. However that's not
possible with UTF-8, as UTF-8 byte sequences have a structure that
possible with UTF-8, as UTF-8 byte sequences have a structure that
doesn't allow arbitrary byte sequence. To increase the reliability
doesn't allow arbitrary byte sequence. To increase the reliability
with which a UTF-8 encoding can be detected, Microsoft invented a
with which a UTF-8 encoding can be detected, Microsoft invented a
variant of UTF-8 (that Python 2.5 calls
"utf-8-sig"
) for its Notepad
variant of UTF-8 (that Python 2.5 calls
\code
{
"utf-8-sig"
}
) for its Notepad
program: Before any of the Unicode characters is written to the file,
program: Before any of the Unicode characters is written to the file,
a UTF-8 encoded BOM (which looks like this as a byte sequence:
0xef
,
a UTF-8 encoded BOM (which looks like this as a byte sequence:
\code
{
0xef
}
,
0xbb, 0xbf) is written. As it's rather improbably that any charmap
\code
{
0xbb
}
,
\code
{
0xbf
}
) is written. As it's rather improbably that any
encoded file starts with these byte values (which would e.g. map to
charmap
encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
LATIN SMALL LETTER I WITH DIAERESIS
\\
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
\\
INVERTED QUESTION MARK
INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig
in iso-8859-1), this increases the probability that a utf-8-sig
...
@@ -624,9 +625,9 @@ encoding can be correctly guessed from the byte sequence. So here the
...
@@ -624,9 +625,9 @@ encoding can be correctly guessed from the byte sequence. So here the
BOM is not used to be able to determine the byte order used for
BOM is not used to be able to determine the byte order used for
generating the byte sequence, but as a signature that helps in
generating the byte sequence, but as a signature that helps in
guessing the encoding. On encoding the utf-8-sig codec will write
guessing the encoding. On encoding the utf-8-sig codec will write
0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding
\code
{
0xef
}
,
\code
{
0xbb
}
,
\code
{
0xbf
}
as the first three bytes to the file.
utf-8-sig will skip those three bytes if they appear as the first
On decoding utf-8-sig will skip those three bytes if they appear as the
three bytes in the file.
first
three bytes in the file.
\subsection
{
Standard Encodings
\label
{
standard-encodings
}}
\subsection
{
Standard Encodings
\label
{
standard-encodings
}}
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment