Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cpython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
cpython
Commits
5c37a771
Commit
5c37a771
authored
Dec 31, 2002
by
Martin v. Löwis
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Document standard encodings.
parent
a8aed02f
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
343 additions
and
0 deletions
+343
-0
Doc/lib/libcodecs.tex
Doc/lib/libcodecs.tex
+343
-0
No files found.
Doc/lib/libcodecs.tex
View file @
5c37a771
...
...
@@ -511,3 +511,346 @@ the \function{lookup()} function to construct the instance.
\class
{
StreamReader
}
and
\class
{
StreamWriter
}
classes. They inherit
all other methods and attribute from the underlying stream.
\subsection
{
Standard Encodings
}
Python comes with a number of codecs builtin, either implemented as C
functions, or with dictionaries as mapping tables. The following table
lists the codecs by name, together with a few common aliases, and the
languages for which the encoding is likely used. Neither the list of
aliases nor the list of languages is meant to be exhaustive. Notice
that spelling alternatives that only differ in case or use a hyphen
instead of an underscore are also valid aliases.
Many of the character sets support the same languages. They vary in
individual characters (e.g. whether the EURO SIGN is supported or
not), and in the assignment of characters to code positions. For the
European languages in particular, the following variants typically
exist:
\begin{itemize}
\item
an ISO 8859 codeset
\item
a Microsoft Windows code page, which is typically derived from
a 8859 codeset, but replaces control characters with additional
graphic characters
\item
an IBM EBCDIC code page
\item
an IBM PC code page, which is ASCII compatible
\end{itemize}
\begin{longtableiii}
{
l|l|l
}{
textrm
}{
Codec
}{
Aliases
}{
Languages
}
\lineiii
{
ascii
}
{
646, us-ascii
}
{
English
}
\lineiii
{
cp037
}
{
IBM037, IBM039
}
{
English
}
\lineiii
{
cp424
}
{
EBCDIC-CP-HE, IBM424
}
{
Hebrew
}
\lineiii
{
cp437
}
{
437, IBM437
}
{
English
}
\lineiii
{
cp500
}
{
EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500
}
{
Western Europe
}
\lineiii
{
cp737
}
{}
{
Greek
}
\lineiii
{
cp775
}
{
IBM775
}
{
Baltic languages
}
\lineiii
{
cp850
}
{
850, IBM850
}
{
Western Europe
}
\lineiii
{
cp852
}
{
852, IBM852
}
{
Central and Eastern Europe
}
\lineiii
{
cp855
}
{
855, IBM855
}
{
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
}
\lineiii
{
cp856
}
{}
{
Hebrew
}
\lineiii
{
cp857
}
{
857, IBM857
}
{
Turkish
}
\lineiii
{
cp860
}
{
860, IBM860
}
{
Portuguese
}
\lineiii
{
cp861
}
{
861, CP-IS, IBM861
}
{
Icelandic
}
\lineiii
{
cp862
}
{
862, IBM862
}
{
Hebrew
}
\lineiii
{
cp863
}
{
863, IBM863
}
{
Canadian
}
\lineiii
{
cp864
}
{
IBM864
}
{
Arabic
}
\lineiii
{
cp865
}
{
865, IBM865
}
{
Danish, Norwegian
}
\lineiii
{
cp869
}
{
869, CP-GR, IBM869
}
{
Greek
}
\lineiii
{
cp874
}
{}
{
Thai
}
\lineiii
{
cp875
}
{}
{
Greek
}
\lineiii
{
cp1006
}
{}
{
Urdu
}
\lineiii
{
cp1026
}
{
ibm1026
}
{
Turkish
}
\lineiii
{
cp1140
}
{
ibm1140
}
{
Western Europe
}
\lineiii
{
cp1250
}
{
windows-1250
}
{
Central and Eastern Europe
}
\lineiii
{
cp1251
}
{
windows-1251
}
{
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
}
\lineiii
{
cp1252
}
{
windows-1252
}
{
Western Europe
}
\lineiii
{
cp1253
}
{
windows-1253
}
{
Greek
}
\lineiii
{
cp1254
}
{
windows-1254
}
{
Turkish
}
\lineiii
{
cp1255
}
{
windows-1255
}
{
Hebrew
}
\lineiii
{
cp1256
}
{
windows1256
}
{
Arabic
}
\lineiii
{
cp1257
}
{
windows-1257
}
{
Baltic languages
}
\lineiii
{
cp1258
}
{
windows-1258
}
{
Vietnamese
}
\lineiii
{
latin
_
1
}
{
iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1
}
{
West Europe
}
\lineiii
{
iso8859
_
2
}
{
iso-8859-2, latin2, L2
}
{
Central and Eastern Europe
}
\lineiii
{
iso8859
_
3
}
{
iso-8859-3, latin3, L3
}
{
Esperanto, Maltese
}
\lineiii
{
iso8859
_
4
}
{
iso-8859-4, latin4, L4
}
{
Baltic languagues
}
\lineiii
{
iso8859
_
5
}
{
iso-8859-5, cyrillic
}
{
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
}
\lineiii
{
iso8859
_
6
}
{
iso-8859-6, arabic
}
{
Arabic
}
\lineiii
{
iso8859
_
7
}
{
iso-8859-7, greek, greek8
}
{
Greek
}
\lineiii
{
iso8859
_
8
}
{
iso-8859-8, hebrew
}
{
Hebrew
}
\lineiii
{
iso8859
_
9
}
{
iso-8859-9, latin5, L5
}
{
Turkish
}
\lineiii
{
iso8859
_
10
}
{
iso-8859-10, latin6, L6
}
{
Nordic languages
}
\lineiii
{
iso8859
_
13
}
{
iso-8859-13
}
{
Baltic languages
}
\lineiii
{
iso8859
_
14
}
{
iso-8859-14, latin8, L8
}
{
Celtic languages
}
\lineiii
{
iso8859
_
15
}
{
iso-8859-15
}
{
Western Europe
}
\lineiii
{
koi8
_
r
}
{}
{
Russian
}
\lineiii
{
koi8
_
u
}
{}
{
Ukrainian
}
\lineiii
{
mac
_
cyrillic
}
{
maccyrillic
}
{
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
}
\lineiii
{
mac
_
greek
}
{
macgreek
}
{
Greek
}
\lineiii
{
mac
_
iceland
}
{
maciceland
}
{
Icelandic
}
\lineiii
{
mac
_
latin2
}
{
maclatin2, maccentraleurope
}
{
Central and Eastern Europe
}
\lineiii
{
mac
_
roman
}
{
macroman
}
{
Western Europe
}
\lineiii
{
mac
_
turkish
}
{
macturkish
}
{
Turkish
}
\lineiii
{
utf
_
16
}
{
U16, utf16
}
{
all languages
}
\lineiii
{
utf
_
16
_
be
}
{
UTF-16BE
}
{
all languages (BMP only)
}
\lineiii
{
utf
_
16
_
le
}
{
UTF-16LE
}
{
all languages (BMP only)
}
\lineiii
{
utf
_
7
}
{
U7
}
{
all languages
}
\lineiii
{
utf
_
8
}
{
U8, UTF, utf8
}
{
all languages
}
\end{longtableiii}
A number of codecs are specific to Python, so their codec names have
no meaning outside Python. Some of them don't convert from Unicode
strings to byte strings, but instead use the property of the Python
codecs machinery that any bijective function with one argument can be
considered as an encoding.
For the codecs listed below, the result in the ``encoding'' direction
is always a byte string. The result of the ``decoding'' direction is
listed as operand type in the table.
\begin{tableiv}
{
l|l|l|l
}{
textrm
}{
Codec
}{
Aliases
}{
Operand type
}{
Purpose
}
\lineiv
{
base64
_
codec
}
{
base64, base-64
}
{
byte string
}
{
Convert operand to MIME base64
}
\lineiv
{
hex
_
codec
}
{
hex
}
{
byte string
}
{
Convert operand to hexadecimal representation, with two digits per byte
}
\lineiv
{
mbcs
}
{
dbcs
}
{
Unicode string
}
{
Windows only: Encode operand according to the ANSI codepage (CP
_
ACP)
}
\lineiv
{
palmos
}
{}
{
Unicode string
}
{
Encoding of PalmOS 3.5
}
\lineiv
{
quopri
_
codec
}
{
quopri, quoted-printable, quotedprintable
}
{
byte string
}
{
Convert operand to MIME quoted printable
}
\lineiv
{
raw
_
unicode
_
escape
}
{}
{
Unicode string
}
{
Produce a string that is suitable as raw Unicode literal in Python source code
}
\lineiv
{
rot
_
13
}
{
rot13
}
{
byte string
}
{
Returns the Caesar-cypher encryption of the operand
}
\lineiv
{
string
_
escape
}
{}
{
byte string
}
{
Produce a string that is suitable as string literal in Python source code
}
\lineiv
{
undefined
}
{}
{
any
}
{
Raise an exception for all conversion. Can be used as the system encoding if no automatic coercion between byte and Unicode strings is desired.
}
\lineiv
{
unicode
_
escape
}
{}
{
Unicode string
}
{
Produce a string that is suitable as Unicode literal in Python source code
}
\lineiv
{
unicode
_
internal
}
{}
{
Unicode string
}
{
Return the internal represenation of the operand
}
\lineiv
{
uu
_
codec
}
{
uu
}
{
byte string
}
{
Convert the operand using uuencode
}
\lineiv
{
zlib
_
codec
}
{
zip, zlib
}
{
byte string
}
{
Compress the operand using gzip
}
\end{tableiv}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment