Appendix E. Common Content Encodings

In an ideal world, the only character encoding (or, loosely, "character set") that you'd ever see would be UTF-8 (utf-8), and Latin-1 (iso-8859-1) for all those legacy documents. However, the encodings mentioned below exist and can be found on the Web. They are listed below in order of their English names, with the lefthand side being the value you'd get returned from $response->content_charset. The complete list of character sets can be found at http://www.iana.org/assignments/character-sets.

Value

Encoding

us-ascii

ASCII plain (just characters 0x00-0x7F)

asmo-708

Arabic ASMO-708

iso-8859-6

Arabic ISO

dos-720

Arabic MSDOS

windows-1256

Arabic MSWindows

iso-8859-4

Baltic ISO

windows-1257

Baltic MSWindows

iso-8859-2

Central European ISO

ibm852

Central European MSDOS

windows-1250

Central European MSWindows

hz-gb-2312

Chinese Simplified (HZ)

gb2312

Chinese Simplified (GB2312)

euc-cn

Chinese Simplified EUC

big5

Chinese Traditional (Big5)

cp866

Cyrillic DOS

iso-8859-5

Cyrillic ISO

koi8-r

Cyrillic KOI8-R

koi8-u

Cyrillic KOI8-U

windows-1251

Cyrillic MSWindows

iso-8859-7

Greek ISO

windows-1253

Greek MSWindows

iso-8859-8-i

Hebrew ISO Logical

iso-8859-8

Hebrew ISO Visual

dos-862

Hebrew MSDOS

windows-1255

Hebrew MSWindows

euc-jp

Japanese EUC-JP

iso-2022-jp

Japanese JIS

shift_jis

Japanese Shift-JIS

iso-2022-kr

Korean ISO

euc-kr

Korean Standard

windows-874

Thai MSWindows

iso-8859-9

Turkish ISO

windows-1254

Turkish MSWindows

utf-8

Unicode expressed as UTF-8

utf-16

Unicode expressed as UTF-16

windows-1258

Vietnamese MSWindows

viscii

Vietnamese VISCII

iso-8859-1

Western European (Latin-1)

windows-1252

Western European (Latin-1) with extra characters in 0x80-0x9F