Python comes with a number of codecs built-in, either implemented as C
functions or with dictionaries as mapping tables. The following table
lists the codecs by name, together with a few common aliases, and the
languages for which the encoding is likely used. Neither the list of
aliases nor the list of languages is meant to be exhaustive. Notice
that spelling alternatives that only differ in case or use a hyphen
instead of an underscore are also valid aliases.
Many of the character sets support the same languages. They vary in
individual characters (e.g. whether the EURO SIGN is supported or
not), and in the assignment of characters to code positions. For the
European languages in particular, the following variants typically
exist:
an ISO 8859 codeset
a Microsoft Windows code page, which is typically derived from
a 8859 codeset, but replaces control characters with additional
graphic characters
an IBM EBCDIC code page
an IBM PC code page, which is ASCII compatible
Codec
Aliases
Languages
ascii
646, us-ascii
English
big5
big5-tw, csbig5
Traditional Chinese
big5hkscs
big5-hkscs, hkscs
Traditional Chinese
cp037
IBM037, IBM039
English
cp424
EBCDIC-CP-HE, IBM424
Hebrew
cp437
437, IBM437
English
cp500
EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500
Western Europe
cp737
Greek
cp775
IBM775
Baltic languages
cp850
850, IBM850
Western Europe
cp852
852, IBM852
Central and Eastern Europe
cp855
855, IBM855
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
cp856
Hebrew
cp857
857, IBM857
Turkish
cp860
860, IBM860
Portuguese
cp861
861, CP-IS, IBM861
Icelandic
cp862
862, IBM862
Hebrew
cp863
863, IBM863
Canadian
cp864
IBM864
Arabic
cp865
865, IBM865
Danish, Norwegian
cp866
866, IBM866
Russian
cp869
869, CP-GR, IBM869
Greek
cp874
Thai
cp875
Greek
cp932
932, ms932, mskanji, ms-kanji
Japanese
cp949
949, ms949, uhc
Korean
cp950
950, ms950
Traditional Chinese
cp1006
Urdu
cp1026
ibm1026
Turkish
cp1140
ibm1140
Western Europe
cp1250
windows-1250
Central and Eastern Europe
cp1251
windows-1251
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
iso8859_6
iso-8859-6, arabic
Arabic
iso8859_7
iso-8859-7, greek, greek8
Greek
iso8859_8
iso-8859-8, hebrew
Hebrew
iso8859_9
iso-8859-9, latin5, L5
Turkish
iso8859_10
iso-8859-10, latin6, L6
Nordic languages
iso8859_13
iso-8859-13
Baltic languages
iso8859_14
iso-8859-14, latin8, L8
Celtic languages
iso8859_15
iso-8859-15
Western Europe
johab
cp1361, ms1361
Korean
koi8_r
Russian
koi8_u
Ukrainian
mac_cyrillic
maccyrillic
Bulgarian, Byelorussian, Macedonian, Russian, Serbian
mac_greek
macgreek
Greek
mac_iceland
maciceland
Icelandic
mac_latin2
maclatin2, maccentraleurope
Central and Eastern Europe
mac_roman
macroman
Western Europe
mac_turkish
macturkish
Turkish
ptcp154
csptcp154, pt154, cp154, cyrillic-asian
Kazakh
shift_jis
csshiftjis, shiftjis, sjis, s_jis
Japanese
shift_jis_2004
shiftjis2004, sjis_2004, sjis2004
Japanese
shift_jisx0213
shiftjisx0213, sjisx0213, s_jisx0213
Japanese
utf_16
U16, utf16
all languages
utf_16_be
UTF-16BE
all languages (BMP only)
utf_16_le
UTF-16LE
all languages (BMP only)
utf_7
U7, unicode-1-1-utf-7
all languages
utf_8
U8, UTF, utf8
all languages
utf_8_sig
all languages
A number of codecs are specific to Python, so their codec names have
no meaning outside Python. Some of them don't convert from Unicode
strings to byte strings, but instead use the property of the Python
codecs machinery that any bijective function with one argument can be
considered as an encoding.
For the codecs listed below, the result in the ``encoding'' direction
is always a byte string. The result of the ``decoding'' direction is
listed as operand type in the table.
Codec
Aliases
Operand type
Purpose
base64_codec
base64, base-64
byte string
Convert operand to MIME base64
bz2_codec
bz2
byte string
Compress the operand using bz2
hex_codec
hex
byte string
Convert operand to hexadecimal representation, with two
digits per byte