diff options
Diffstat (limited to 'doc/rfc/rfc1922.txt')
-rw-r--r-- | doc/rfc/rfc1922.txt | 1515 |
1 files changed, 1515 insertions, 0 deletions
diff --git a/doc/rfc/rfc1922.txt b/doc/rfc/rfc1922.txt new file mode 100644 index 0000000..981b91a --- /dev/null +++ b/doc/rfc/rfc1922.txt @@ -0,0 +1,1515 @@ + + + + + + +Network Working Group HF. Zhu +Request for Comments: 1922 Tsinghua U +Category: Informational DY. Hu + Tsinghua U + ZG. Wang + CITS + TC. Kao + III + WCH. Chang + III + M. Crispin + U Washington + March 1996 + + + Chinese Character Encoding for Internet Messages + +Status of this Memo + + This memo provides information for the Internet community. It does + not specify an Internet standard. Distribution of this memo is + unlimited. + +Abstract + + This memo describes methods of transporting Chinese characters in + Internet services which transport text, such as electronic mail + [RFC-822], network news [RFC-1036], telnet [RFC-854] and the World + Wide Web [RFC-1866]. + +Introduction + + As the use of Internet covers more and more Chinese people in the + world, the need has increased for the ability to send documents + containing Chinese characters on the Internet. The methods described + in this document provide means of transporting existing Chinese + character sets as well as leaving space for future extension. + + This document describes two encodings, ISO-2022-CN and + ISO-2022-CN-EXT. These are designed with interoperability in mind + and are encouraged in this document for current Chinese interchange; + they are 7-bit, support both simplified and traditional characters + using both GB and CNS/Big5, and do not impose any unusual quoting + requirements on ASCII characters. + + As important related issues, this document gives detailed + descriptions of the two encodings CN-GB and CN-Big5, and a brief + description of ISO/IEC 10646 [ISO-10646]. CN-GB and CN-Big5 are + + + +Zhu, et al Informational [Page 1] + +RFC 1922 Chinese Character Encoding March 1996 + + + currently used as the internal codes for Chinese documents. + ISO-10646 is the universal multi-octet character set defined by ISO; + we feel that in the future it may become the preferred technology for + Chinese documents and electronic mail when it is widely available. + +Specification + +1. 7-bit Chinese encodings: ISO-2022-CN and ISO-2022-CN-EXT + +1.1. Description + + ISO-2022-CN is based on ISO 2022 [ISO-2022], similar to earlier work + on ISO-2022-JP [RFC-1468] and ISO-2022-KR [RFC-1557] for the Japanese + and Korean languages respectively. It is 7-bit, and supports both + simplified Chinese characters using GB 2312-80 [GB-2312] and + traditional Chinese characters using the first two planes of CNS + 11643 [CNS-11643], as well as ASCII [ASCII] characters. + + ISO-2022-CN-EXT is a superset of ISO-2022-CN that additionally + supports other GB character sets and planes of CNS 11643. + + Since ISO-2022-CN and ISO-2022-CN-EXT are 7-bit encodings, they do + not require the 8-bit SMTP extensions. ISO-2022-CN supports all the + Chinese characters that appear in Big5 [BIG5]. + +1.2. ISO-2022-CN + + The starting code of ISO-2022-CN is ASCII. ASCII and Chinese + characters are distinguished by designations (ESC sequences) and + shift functions. + + Designations define the Chinese character sets used in the text. + There are three kinds of designations: SOdesignation, SS2designation + and SS3designation. + + The SOdesignation is in the form ESC $ ) <F>, where <F> is the "final + character" assigned to the character set by ISO (refer to the ISO + registry [ISOREG] for more details). The SS2designation is in the + form ESC $ * <F>, and the SS3designation is in the form ESC $ + <F>. + A designation overrides any previous designation for subsequent bytes + in the text. + + There are four kinds of shifts: SI, SO, SS2 and SS3. Shift functions + specify how to interpret the subsequent bytes. + + The shift SI (one byte with hexadecimal value 0F) declares that + subsequent bytes are interpreted in ASCII. + + + + +Zhu, et al Informational [Page 2] + +RFC 1922 Chinese Character Encoding March 1996 + + + The shift SO (one byte with hexadecimal value 0E) declares that + subsequent bytes are interpreted in the character set defined by + SOdesignation. + + The shift SS2 (two bytes with hexadecimal values 1B 4E) declares that + the subsequent TWO bytes are interpreted in the character set defined + by SS2designation, after which the previous interpretation (from SI + or SO) is restored. + + The shift SS3 (two bytes with hexadecimal values 1B 4F) declares that + the subsequent TWO bytes are interpreted in the character set defined + by SS3designation, after which the previous interpretation (from SI + or SO) is restored. + + The escape sequences, shift functions and character sets used in an + ISO-2022-CN text are as follows: + + Character sets Shift in with + -------------------------------------------------------------------- + ASCII SI + GB 2312, CNS 11643-plane-1 SO + CNS 11643-plane-2 SS2 + + ESC $ ) A Indicates the bytes following SO are Chinese + characters as defined in GB 2312-80, until + another SOdesignation appears + + ESC $ ) G Indicates the bytes following SO are as defined + in CNS 11643-plane-1, until another + SOdesignation appears + + ESC $ * H Indicates the two bytes immediately following + SS2 is a Chinese character as defined in CNS + 11643-plane-2, until another SS2designation + appears + + If there are any GB or CNS characters on a line, a designation for + the corresponding character set must be used so that each line has + its own character set information and the text can be displayed + correctly when scroll back in a window. Also, there must be a shift + to ASCII (SI) before the end of the line (i.e., before the CRLF). In + other words, each line starts in ASCII, and ends in ASCII. + + Example: the hex sequence + + 1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f + + represents the Chinese word for "Interchange" (jiao huan) twice; + + + +Zhu, et al Informational [Page 3] + +RFC 1922 Chinese Character Encoding March 1996 + + + the first time in simplified form using GB-2312 (the 3d 3b 3b 3b + sequence above), and the second time in traditional form using + CNS-11643 (the 47 28 5f 50 sequence above). The sequence 1b 24 29 + 41 is the SOdesignation for GB-2312, the 0e is SO to switch to + Chinese from ASCII, the 1b 24 29 47 is the SOdesignation for + CNS-11643 plane 1, and finally the 0f is the SI to return to ASCII + at the end of the line. + + The name given to this character encoding is "ISO-2022-CN". This name + is intended to be used as the "charset" parameter in MIME [MIME-1, + MIME-2] messages. + + Content-Type: text/plain; charset=iso-2022-cn + + The ISO-2022-CN encoding is already in 7-bit form, so it is not + necessary to use a Content-Transfer-Encoding header. + + Other restrictions are given in the "Formal Syntax of ISO-2022-CN" + (Section 7.1 of this document). + +1.3. ISO-2022-CN-EXT + + ISO-2022-CN-EXT supports all characters in existing GB, Big5 and CNS + 11643 character sets. + + The escape sequences, shift functions and character sets used in an + ISO-2022-CN-EXT text are as follows: + + Character sets Shift in with + -------------------------------------------------------------------- + ASCII SI + GB 2312, GB 12345, CNS 11643-plane-1, ISO-IR-165 SO + GB 7589, GB 13131, CNS 11643-plane-2 SS2 + GB 7590, GB 13132 or other new GBs,CNS 11643-plane-3 or SS3 + higher planes of CNS 11643 + + Note: Currently, there are some GB sets that have not been + registered in ISO. Here <X7589>, <X7590>, <X12345>, <X13131> and + <X13132> represent the final character that will be assigned by + ISO for those sets. These GB sets shall only be used once these + final characters are assigned. + + + + + + + + + + +Zhu, et al Informational [Page 4] + +RFC 1922 Chinese Character Encoding March 1996 + + + ESC $ ) A Indicates the bytes following SO are Chinese + characters as defined in GB 2312-80, until + another SOdesignation appears + + ESC $ * <X7589> Indicates the two bytes immediately following + SS2 is a Chinese character as defined in GB + 7589-87 [GB-7589], until another SS2designation + appears + + ESC $ + <X7590> Indicates the two bytes immediately following + SS3 is a Chinese character as defined in GB + 7590-87 [GB-7590], until another SS3designation + appears + + ESC $ ) <X12345> Indicates the bytes following SO are as defined + in GB 12345-90 [GB-12345], until another + SOdesignation appears + + ESC $ * <X13131> Indicates the two bytes immediately following + SS2 is a Chinese character as defined in GB + 13131-91 [GB-13131], until another + SS2designation appears + + ESC $ + <X13132> Indicates the two bytes immediately following + SS3 is a Chinese character as defined in GB + 13132-91 [GB-13131], until another + SS3designation appears + + ESC $ ) E Indicates the bytes following SO are as defined + in ISO-IR-165 (for details, see section 2.1), + until another SOdesignation appears + + ESC $ ) G Indicates the bytes following SO are as defined + in CNS 11643-plane-1, until another + SOdesignation appears + + ESC $ * H Indicates the two bytes immediately following + SS2 is a Chinese character as defined in CNS + 11643-plane-2, until another SS2designation + appears + + ESC $ + I Indicates the immediate two bytes following SS3 + is a Chinese character as defined in CNS + 11643-plane-3, until another SS3designation + appears + + + + + + +Zhu, et al Informational [Page 5] + +RFC 1922 Chinese Character Encoding March 1996 + + + ESC $ + J Indicates the immediate two bytes following SS3 + is a Chinese character as defined in CNS + 11643-plane-4, until another SS3designation + appears + + ESC $ + K Indicates the immediate two bytes following SS3 + is a Chinese character as defined in CNS + 11643-plane-5, until another SS3designation + appears + + ESC $ + L Indicates the immediate two bytes following SS3 + is a Chinese character as defined in CNS + 11643-plane-6, until another SS3designation + appears + + ESC $ + M Indicates the immediate two bytes following SS3 + is a Chinese character as defined in CNS + 11643-plane-7, until another SS3designation + appears + + As in ISO-2022-CN, each line starts in ASCII, and ends in ASCII, and + has its own designation information before any Chinese characters + appear. + + The name given to this character encoding is "ISO-2022-CN-EXT". This + name is intended to be used as the "charset" parameter in MIME + messages. + + Content-Type: text/plain; charset=ISO-2022-CN-EXT + + The ISO-2022-CN-EXT encoding is also in 7-bit form, so it is not + necessary to use a Content-Transfer-Encoding header. + + Other restrictions are given in the "Formal Syntax of + ISO-2022-CN-EXT" (Section 7.2 of this document). + +1.4. How to Support Big5 or other internal codesets with ISO-2022-CN + and ISO-2022-CN-EXT + + Since there are many different Chinese internal coding systems + [CJKINF], such as EUC GB, Big5, CCCII (an encoding for library + systems mainly used in Taiwan), GBK (the new standard specification + for Chinese internal code, also is the codepage for Microsoft + simplified Chinese Windows 95) etc., ISO-2022-CN and ISO-2022-CN-EXT, + which are 7-bit and will not lose information during communication + among different codesets, facilitate interchange between the various + Chinese coding systems in the Internet. + + + + +Zhu, et al Informational [Page 6] + +RFC 1922 Chinese Character Encoding March 1996 + + + For instance, ISO-2022-CN and ISO-2022-CN-EXT can be used to support + the popular Big5 codeset, because the first two planes of CNS-11643 + contain the same Chinese characters as Big5's "common part" except + two duplicate characters. By the "common part" we mean the part that + is not specific to any Big5 vendor, consisting of 5401 more + frequently used characters in Big5 range 0xA440-0xC67E, 7652 less + frequently used characters in Big5 range 0xC940-0xF9D5, and 441 other + symbols in Big5 range 0xA140-0xA3E0, as defined in Institute for + Information Industry's (III) technical report C-26 (see also [Big5]). + The appendix of this document presents a conversion table for + converting Big5 into CNS-11643, including specific extensions of some + popular vendors. For other extensions, vendors and implementors of + Big5 products are ENCOURAGED to create detailed conversion tables, in + order to increase interoperability between different coding systems. + + Public domain software (binary or C source code) for conversion + between Big5 and CNS-11643 is available on many Internet sites. At + the time of this writing, the following FTP sites and software are + advertised: + + 1) Beijing: + ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/convert/big5cns.zip + (IP address: 166.111.1.6) + + 2) Xi'an: + ftp://ftp.xanet.edu.cn + /pub/chinese-soft/unix/convert/BeTTY-1.534.tar.gz + (IP address: 202.112.11.131) + + 3) Taiwan: + ftp://ftp.seed.net.tw/Pub/Chinese/DOS/code-convert/chcode.zip + (IP address: 140.92.1.65) + + 4) US: + ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz + (IP address: 128.123.1.55) + + 5) Japan: + ftp://etlport.etl.go.jp/pub/iso-2022-cn/convert/big5cns.zip + (IP address: 192.31.197.99) + + + + + + + + + + + +Zhu, et al Informational [Page 7] + +RFC 1922 Chinese Character Encoding March 1996 + + +2. 8-bit Chinese encodings: CN-GB and CN-Big5 + + The CN-GB and CN-Big5 MIME charsets are defined below. + + Note: the use of 8-bit character sets requires the use of either + an 8-to-7 Content-Transfer-Encoding mechanism such as "BASE64" or + "QUOTED-PRINTABLE" if the network is not 8-bit clean, or the 8-bit + SMTP extensions [SMTPEXT] with the "8BIT" + Content-Transfer-Encoding on 8-bit clean networks. Otherwise, an + 8-bit message that passes through a 7-bit mailer is likely to have + the 8th bit truncated, resulting in an unreadable message. + Although "just send 8-bit data" has been common practice in the + past, it is incorrect according to the Internet standards and + causes interoperability problems. + +2.1. CN-GB + + E-mail using CN-GB characters is sent in this way: + + GB 2312-80 characters are used with ASCII characters, not GB 1988-89 + [GB-1988]. + + GB 2312-80 is also 7-bit, to avoid conflicting with ASCII. If the + character is from GB 2312-80, the MSB (bit-8) of each byte is set to + 1, and therefore becomes a 8-bit character. Otherwise, the byte is + interpreted as ASCII. This constructs a character set named "GB + Internal Code". + + This method is also adopted in the .gb files in the Internet. + + To use this character scheme with MIME, CN-GB is used as the value + for the charset parameter: + + Content-Type: text/plain; charset=cn-gb; charset-edition=1980 + + Note: The "charset-edition" is a new MIME parameter described in + section 4.1 of the "Specification" part of this document. + + GB 12345-90 is the traditional form of GB 2312, the charset name + given to this set is CN-GB-12345 with the charset-edition of 1990. + + There are also character sets that can only be used with other GB + sets. For example, GB 8565-88 [GB-8565] is used with GB 2312 and + some other characters to form the ISO-IR-165 set (also known as GB + 2312 + GB 8565.2). ISO-IR-165 contains all characters from GB + 2312-80 as revised by GB 6345.1-86 and GB 8565.2-88. Its MIME + charset name is CN-GB-ISOIR165 with the charset-edition of 1992. + + + + +Zhu, et al Informational [Page 8] + +RFC 1922 Chinese Character Encoding March 1996 + + + CN-GB-12345 and CN-GB-ISOIR165 support ASCII in a similar manner to + CN-GB; the MSB of Chinese characters is set to 1 to distinguish from + ASCII. + + Note: There are some supplementary character sets in GB, i.e. GB + 7589-87, GB 7590-87, GB 13131-91 and GB 13132-91. Normally, they + won't be used independently without using GB-2312 or GB-12345, so + they are not necessarily to be registered. Characters in these + standards could be supported with ISO-2022-CN and ISO-2022-CN-EXT. + If, in the future, they need to be used with "charset" names, it + is the responsibility of any interested third party (the + standardization organization or anybody else) to write the + necessary documents and register the charset with the IANA. It is + encouraged that the charset names take the form of CN-GB-<number>, + such as CN-GB-12345, where <number> is the GB standard number. A + charset-edition should also be given. All CN-GB-<number> sets + should be coded in 8-bit in a similar fashion to CN-GB. + + To ensure interoperability, the CN-GB charset should be used whenever + possible instead of a CN-GB-<number> charset. + +2.2. CN-Big5 + + Big5 is a two-byte character set of traditional Chinese characters, + widely used in Taiwan and overseas. E-mail of CN-Big5 is sent in + this way: + + Big5 is used with ASCII. The MSB of ASCII characters is always 0. + The MSB of the first byte of a Big5 character is always 1; this + distinguishes it from an ASCII character. The second byte has 8 + significant bits. Therefore, CN-Big5 is an 8-bit encoding with a + 15-bit codespace. + + To use this character scheme with MIME, CN-Big5 is used as the value + for the charset parameter: + + Content-Type: text/plain; charset=cn-big5; charset-edition=1984 + + Note: The "charset-edition" is a new MIME parameter described in + section 4.1 of the "Specification" part of this document. + +3. Universal Multilingual Character Set: ISO/IEC-10646/Unicode + + ISO/IEC 10646 defines a 32bit character space with the intent to + encode all characters in the world. Currently, only the lowest 16bit + plane of ISO 10646, the Basic Multilingual Plane (BMP), is defined. + The BMP is code-by-code identical to Unicode [Unicode 1.1]. it + contains a large repertoire of Chinese characters (it currently + + + +Zhu, et al Informational [Page 9] + +RFC 1922 Chinese Character Encoding March 1996 + + + includes all the characters of GB 2312-80, GB 12345-90, GB 8565-89, + CNS 11643's plane 1 and 2, and part of some other standards) and + therefore can be used to transport Chinese characters in the Internet + community. This document does not give any details on how to do + this, as this has been done elsewhere. For details of using Unicode + with MIME, refer to RFC 1641 [RFC-1641], RFC 1642 [RFC-1642]. For + assigned names for 10646 set, refer to STD 2--"Assigned Numbers", + which is RFC 1700 [RFC-1700] currently. For more up-to-date assigned + numbers, please check: + + ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets + +4. Two New MIME parameters + + Here we define two new MIME parameters to be used with "charset" + parameters. + +4.1. "charset-edition" + + This parameter is used after the MIME "charset" parameter, using four + digits (AD) to indicate what the year of edition is for the character + set standard shown in "charset". Its use is optional. + Implementations should ignore this parameter unless the + implementation has specific support for that particular character set + edition. + + The reason for defining this parameter is that there are often + differences in the defined characters between editions of a character + set standard. Sometimes, the difference can not be ignored, + otherwise implementations would have problems when processing it. + There are only two ways to indicate this difference, in the current + MIME syntax. One way is to indicate the edition in the charset name, + such as CN-GB-1988-80 (the 1980's edition of GB 1988). The other way + is to define a new optional parameter such as "charset-edition". The + latter way is better because receiving applications that can only + process an older edition can still recognize the character set and + offer to display the text in the older edition. This display may + have a few mistakes, but it is better than refusing to display any + text at all or defaulting to an inappropriate character set such as + US-ASCII or ISO-8859-1. + +4.2. "charset-extension" + + This parameter is also used after the MIME "charset" parameter. It + is case-insensitive and optional, and any value of this parameter + should be registered in IANA. Unregistered value should start with + "x-" as with any MIME extension-token. Implementations should ignore + this parameter unless the implementation has specific support for + + + +Zhu, et al Informational [Page 10] + +RFC 1922 Chinese Character Encoding March 1996 + + + that particular character set extension. + + A character set extension has displayed glyphs for code points that + are not assigned in the character set, for example, vendor-specific + extensions of standard character sets. This parameter provides the + option of using these extensions. Although character set extensions + may cause interoperability problems, we recognize the existence of + such extensions. + + For example: + Content-Type: text/plain; charset=CN-Big5; charset-edition=1984; + charset-extension=ETen-2.00.03-DOS + + This may indicate Eten company's extension of Big5: ETen 2.00.03 for + DOS, assuming that "ETen-2.00.03-DOS" is registered with the IANA.. + +4.3. Formal Syntax: + + The following changes and additions are made to the MIME syntax: + + charset-edition := "charset-edition" "=" 4DIGIT + ; year of edition in four digits + + charset-extension := "charset-extension" "=" extension-token + +5. Background Information + +5.1. Writing systems and their encodings in Chinese-speaking nations and + regions + + The mainland provinces of China use simplified Chinese character in + daily life. GB is the standard electronic character set. It is the + main means for communications between people who share simplified + Chinese characters in the world. + + Taiwan uses traditional Chinese characters in daily life. CNS-11643 + is the formal character set for information interchange in Taiwan; + however, Big5, a widely-used character set of traditional Chinese + characters, is the de-facto internal code standard in Taiwan. + + Hong Kong uses traditional Chinese characters in daily life, but uses + both GB and Big5 in electronic form, because Hong Kong people often + communicate with people in all of China's provinces. + + Singapore seldom uses Chinese characters, and uses the simplified + form when Chinese characters are used. In electronic form, Unicode + is more popular, however GB is also used. + + + + +Zhu, et al Informational [Page 11] + +RFC 1922 Chinese Character Encoding March 1996 + + +5.2. Miscellaneous information about Chinese character sets + + The GB 1988-89 character set is identical to ISO 646 [ISO-646] except + for currency symbol and tilde. The currency symbol and the tilde are + replaced by the Yuan sign and the overline. This set is GB's variant + of ISO 646. This character set and CNS 5205 [CNS-5205] are not + encouraged for use in the Internet, since ASCII combined with GB 2312 + or CNS 11643-plane 1 and plane 2 contains all the characters in them. + + The GB 2312-80 character set consists of simplified Chinese + characters, digits, and the Latin, Greek and Russian alphabets, and + some other symbols; in all, 7445 characters. Each character is + represented with two bytes. + + GB 13000-95 [GB-13000] is GB's variant of ISO 10646. However, for + interoperability in the Internet, assigned names for ISO 10646 are + encouraged instead. + + Currently both sides of the Taiwan Straits are cooperating closely in + promoting the use of ISO 10646's BMP and in continuing its + development together with other organizations under ISO. + +5.3. Miscellaneous implementation information + + For maximum interoperability, implementations SHOULD at least support + sending and receiving ISO-2022-CN. Supporting all registered + character sets in ISO-2022-CN-EXT is greatly encouraged. + + To meet the current usage, support of CN-GB (the status quo for + simplified Chinese e-mail ) or CN-Big5 (the status quo for + traditional Chinese e-mail) may be necessary. However, it is not + reliable to send documents directly with these internal codes, + therefore sending ISO-2022-CN message is always encouraged whenever + possible. + + To the maximum extent possible, implementations should be capable of + receiving messages in any of the encodings described in this + document, even if they only transmit messages in one form. + + Preferably the implementation should display the characters with + glyphs appropriate to the typographic tradition that is implied in + the encoding of the received text. Implementation may also translate + these encodings to the encoding that its platform supports. + + The human user (not implementor) should try to keep lines within 80 + display columns, or, preferably, within 75 (or so) columns, to allow + insertion of ">" at the beginning of each line in excerpts. Each + Chinese character takes up two columns, and the shift sequences do + + + +Zhu, et al Informational [Page 12] + +RFC 1922 Chinese Character Encoding March 1996 + + + not take up any columns. The implementor is reminded that Chinese + characters take up two bytes and should not be split in the middle to + break lines for displaying, etc. + + Freely available fonts of Chinese characters: + + Beijing: + ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/fonts/ + + Xi'an: + ftp://ftp.xanet.edu.cn/pub/chinese-soft/fonts/ + + Taiwan: + ftp://ftp.edu.tw/Chinese/ifcss/software/fonts/ + ftp://ftp.ntu.edu.tw/Chinese/ifcss/software/fonts/ + + Hong Kong: + ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/fonts/ + + Singapore: + ftp://ftp.technet.sg:/pub/chinese/fonts/ + + US: + ftp://ftp.ifcss.org/pub/software/fonts/ + http://ccic.ifcss.org/www/pub/software/fonts/ + +6. X.400 Considerations + + X.400 has the ability of carrying different character sets in a + message by using the body part "GeneralText" defined by + ISO/IEC-10021-7 [ISO-10021]. + + The X.400 ASN.1 definition of the GeneralText body part is: + + general-text-body-part EXTENDED-BODY-PART-TYPE + PARAMETERS GeneralTextParameters IDENTIFIED BY id-ep-general-text + DATA GeneralTextData + ::= id-et-general-text + + GeneralTextParameters ::= SET OF CharacterSetRegistration + + CharacterSetRegistration ::= INTEGER (1..32767) + + GeneralTextData ::= GeneralString + + Therefore, to use ISO-2022-CN, set the "CharacterSetRegistration" + part as { 6 58 171 172 }, and add an ESC sequence of ESC ( B (three + bytes, hexadecimal values: 1B 28 42) before the beginning of each + + + +Zhu, et al Informational [Page 13] + +RFC 1922 Chinese Character Encoding March 1996 + + + line of ISO-2022-CN text. + + Similarly, to use ISO-2022-CN-EXT, set the registered numbers of all + character sets in the "CharacterSetRegistration" part and add ESC ( B + at the beginning of each line. For the registered numbers, please + refer to ISO registry. In addition to the character sets supported + by ISO-2022-CN, currently registered numbers are: + + ISO IR 165 (GB 2312+GB 8565.2): 165 + CNS 11643-plane 3: 183 + CNS 11643-plane 4: 184 + CNS 11643-plane 5: 185 + CNS 11643-plane 6: 186 + CNS 11643-plane 7: 187 + + 176 is the registered number for the BASESET of ISO/IEC 10646-1:1993 + UCS-2 with implementation level 3, Escape sequence of ESC % / E (four + bytes, hexadecimal values 1B 25 2F 45) indicates starting of this + codeset. + + For CN-GB and CN-Big5 character sets, there are no formal methods + that could be used in X.400 yet. + + For detail about X.400 use of character sets, please refer to RFC + 1502 [RFC-1502]. + + + + + + + + + + + + + + + + + + + + + + + + + + +Zhu, et al Informational [Page 14] + +RFC 1922 Chinese Character Encoding March 1996 + + +7. Formal Syntax of ISO-2022-CN and ISO-2022-CN-EXT + + The notational conventions used here are identical to those used in + RFC 822. + +7.1. Formal Syntax of ISO-2022-CN + + body ::= * ( ascii_line / c_line ) + + ascii_line ::= *char CRLF + + c_line ::= *char 1*(1*designation 1*(*char 1*c_text *char)) CRLF + + designation ::= SOdesignation / SS2designation + + SOdesignation ::= ESC "$" ")" finalchar_for_SO + + SS2designation ::= ESC "$" "*" finalchar_for_SS2 + + finalchar_for_SO ::= "A" / "G" + + finalchar_for_SS2 ::= "H" + + c_text ::= 1* ( SO-SI-segment / SS2segment ) + + SO-SI-segment ::= SO 1*c_char *designation *c_segment SI + + c_segment ::= 1* ( c_char / SS2segment ) + + SS2segment ::= SS2 c_char + + c_char ::= one_of_94 one_of_94 + + ; ( Octal, Decimal.) + + ESC ::= <ISO-646 ESC, escape> ; ( 33, 27.) + + SI ::= <ASCII SI, shift in> ; ( 17, 15.) + + SO ::= <ASCII SO, shift out> ; ( 16, 14.) + + SS2 ::= <ISO 2022 Single_shift two> ; ( 33 116, 27 78.) + + one_of_94 ::= <any char in 94_char set> ; ( 41-176, 33-126. ) + + char ::= <any char in 96_char_set> ; ( 40-177, 30-127. ) + + + + + +Zhu, et al Informational [Page 15] + +RFC 1922 Chinese Character Encoding March 1996 + + +7.2. Formal Syntax of ISO-2022-CN-EXT + + body ::= * ( ascii_line / c_line ) + + ascii_line ::= *char CRLF + + c_line ::= *char 1*(1*designation 1*(*char 1*c_text *char)) CRLF + + designation ::= SOdesignation / SS2designation / SS3designation + + SOdesignation ::= ESC "$" ")" finalchar_for_SO + + SS2designation ::= ESC "$" "*" finalchar_for_SS2 + + SS3designation ::= ESC "$" "+" finalchar_for_SS3 + + finalchar_for_SO ::= "A" / <X12345> / "G" / "E" + + finalchar_for_SS2 ::= <X7589> / <X13131> / "H" + + finalchar_for_SS3 ::= <X7590> / <X13132> / "I" / "J" / "K" / "L" + / "M" + + c_text ::= 1* ( SO-SI-segment / SS2segment / SS3segment ) + + SO-SI-segment ::= SO 1*c_char *designation *c_segment SI + + c_segment ::= 1* ( c_char / SS2segment / SS3segment ) + + SS2segment ::= SS2 c_char + + SS3segment ::= SS3 c_char + + c_char ::= one_of_94 one_of_94 + + ; ( Octal, Decimal.) + + ESC ::= <ISO-646 ESC, escape> ; ( 33, 27.) + + SI ::= <ASCII SI, shift in> ; ( 17, 15.) + + SO ::= <ASCII SO, shift out> ; ( 16, 14.) + + SS2 ::= <ISO 2022 Single_shift two> ; ( 33 116, 27 78.) + + SS3 ::= <ISO 2022 Single_shift three>; ( 33 117, 27 79.) + + one_of_94 ::= <any char in 94_char set> ; ( 41-176, 33-126. + + + +Zhu, et al Informational [Page 16] + +RFC 1922 Chinese Character Encoding March 1996 + + + ) + + char ::= <any char in 96_char_set> ; ( 40-177, 30-127. + ) + + +8. Registration of New "charset"s and New MIME parameter + +8.1. This document defines the following MIME "charset" names for + Chinese text: + + ISO-2022-CN, ISO-2022-CN-EXT + CN-GB, CN-Big5 + CN-GB-12345 + CN-GB-ISOIR165 + +8.2. This document defines two new MIME parameters: + + charset-edition + charset-extension + +Acknowledgments + + This document is the result of cooperation in APNG-CC, the Chinese + Character sub-working group of the I18N/L10N (Internationalization + and Localization) working group of APNG (Asia-Pacific Networking + Group), coordinator Zhu Haifeng <zhf@net.tsinghua.edu.cn>. The + membership of APNG-CC consists of individuals from both sides of the + Taiwan Strait, HongKong, and from Singapore and other countries. We + wish to thank all members of APNG-CC. + + Prof. Yao Shiquan (Deputy chair of CITS--China Information Technology + Standardization Technical Committee), Ms. Lin Ning (Secretary-General + of CITS), Mr. Guo Chengzhong of the Office of the Joint Conference of + China Economic Information, and Prof. Zhao Jingrong, Prof. Wu + Jianping, Prof. Li Xing, and Mr. You Yue (Tsinghua University) and + other experts from CERNET Expert Committee, Prof. Meng Qingyu (China + Computer Software & Technology Services Corporation), Prof. Cao + Jinwen and Mr. Yu Jun (IBM Beijing) gave a lot of support and help in + many aspects. + + Special thanks for the supports towards APNG-CC from Prof. Yang + Tianxing (Chair of CITS). + + Prof. Ding ZyKaan from Academia Sinica of Taiwan, and Mr. C. J. + Cherng and Mr. C. K. Fan of III (Institute for Information Industry), + Mr. Chang JingShin from Tsinghua University in Hsinchu of Taiwan, Ms. + C. C. Hsu from IBM Taiwan and Ms. Tong-Lee Anita Lin from Microsoft + + + +Zhu, et al Informational [Page 17] + +RFC 1922 Chinese Character Encoding March 1996 + + + Taiwan gave a lot of support and contributions in APNG-CC's work. In + particular, Ms. C. C. Hsu put much effort towards completing the + Appendix of this document. + + We also wish to thank the following people who contributed in many + ways towards this document. + + Zhang Zhoucai Martin J. Duerst + Zhang Ling Kenichi Handa + Zhu Bin Lu Chin + Sun Yufang Nelson Chin + Chen Shuyi Mao Yonggang + Masataka Ohta Ken Lunde + Lua Kim Teng Victor Cheng + Stephen G. Simpson Yuan Jiang + Liu Huifang Harald T. Alvestrand + Qian Hualin Jiang Lin + Lu Ming Emily Hsu + Wu Jian Zhu Shuang + Zheng Long Zhang Hailin + Yonggang Zhang Feng Hui + Yao Jian + +Security Considerations + + Security issues are not discussed in this memo. + +Authors' Addresses + + Zhu Haifeng (HF. Zhu) + 216 Central Main Building + Tsinghua University + Beijing, 100084 + China + + Tel: +86-10-2561144 ext. 3492 + Fax: +86-10-2564173 + EMail: zhf@net.tsinghua.edu.cn, zhf@net.edu.cn + + + + + + + + + + + + + +Zhu, et al Informational [Page 18] + +RFC 1922 Chinese Character Encoding March 1996 + + + Hu Daoyuan (DY. Hu) + Tsinghua Networking Center + Tsinghua University + Beijing, 100084 + China + + Tel: +86-10-2594016 + Fax: +86-10-2564173 + EMail: hdy@tsinghua.edu.cn + + + Wang Zhiguan (ZG. Wang) + Beijing 1101 MailBox + SubCommitte 2 (SC2) + China Information Technology Standardization Technical Committee + (CITS) + Beijing, 100007 + China + + Tel: +86-10-4012392 + Fax: +86-10-4010601 + + + Kao Tien-cheu (TC. Kao) + I.T. Promotion Division + Institute for Information Industry (III) + Taipei + Taiwan + + Tel: +886-2-5631688 + Fax: +886-2-563-4209 + EMail: tckao@iiidns.iii.org.tw + + + Chang Wen-chung (WCH. Chang) + Institute for Information Industry (III) + Taipei + Taiwan + + Tel: +886-2-7327771 + Fax: +886-2-7370188 + EMail: chung@iiidns.iii.org.tw + + + + + + + + + +Zhu, et al Informational [Page 19] + +RFC 1922 Chinese Character Encoding March 1996 + + + Mark R. Crispin + Networks and Distributed Computing + University of Washington + 4545 15th Avenue NE + Seattle, WA 98105-4527 + USA + + Tel: +1 (206) 543-5762 + Fax: +1 (206) 685-4045 + EMail: MRC@CAC.Washington.EDU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Zhu, et al Informational [Page 20] + +RFC 1922 Chinese Character Encoding March 1996 + + +Appendix -- Conversion Table for ISO-2022-CN (EXT) and Big5 + + This is a conversion table for the Chinese characters in Big5's + common part and ISO-2022-CN/-EXT, including all the vendor-specific + characters from Eten, Microsoft and IBM. For conversion source and + binary programs for Big5, III provides good on-line services (ftp + site listed in section 1.4), and [CJKINF] is also a good reference. + +A.1. Big5 (ETen, IBM, and Microsoft version) symbol set correspondence + to CNS 11643 Plane 1: + + 0xA140-0xA1F5 <-> 0x2121-0x2256 + 0xA1F6 <-> 0x2258 + 0xA1F7 <-> 0x2257 + 0xA1F8-0xA2AE <-> 0x2259-0x234E + 0xA2AF-0xA3BF <-> 0x2421-0x2570 + 0xA3C0-0xA3E0 <-> 0x4221-0x4241 (ETen and Microsoft + defined as reserved area) + +A.2. Big5 (ETen, IBM, and Microsoft version) Level 1 correspondence to + CNS 11643-1992 Plane 1: + + 0xA440-0xACFD <-> 0x4421-0x5322 + 0xACFE <-> 0x5753 + 0xAD40-0xAFCF <-> 0x5323-0x5752 + 0xAFD0-0xBBC7 <-> 0x5754-0x6B4F + 0xBBC8-0xBE51 <-> 0x6B51-0x6F5B + 0xBE52 <-> 0x6B50 + 0xBE53-0xC1AA <-> 0x6F5C-0x7534 + 0xC1AB-0xC2CA <-> 0x7536-0x7736 + 0xC2CB <-> 0x7535 + 0xC2CC-0xC360 <-> 0x7737-0x782C + 0xC361-0xC3B8 <-> 0x782E-0x7863 + 0xC3B9 <-> 0x7865 + 0xC3BA <-> 0x7864 + 0xC3BB-0xC455 <-> 0x7866-0x7961 + 0xC456 <-> 0x782D + 0xC457-0xC67E <-> 0x7962-0x7D4B + +A.3. Big5 (ETen, IBM, and Microsoft version) Level 2 correspondence to + CNS 11643-1992 Plane 2: + + 0xC940-0xC949 <-> 0x2121-0x212A + 0xC94A <-> 0x4442 # duplicate of Level 1's 0xA461 + 0xC94B-0xC96B <-> 0x212B-0x214B + 0xC96C-0xC9BD <-> 0x214D-0x217C + 0xC9BE <-> 0x214C + 0xC9BF-0xC9EC <-> 0x217D-0x224C + + + +Zhu, et al Informational [Page 21] + +RFC 1922 Chinese Character Encoding March 1996 + + + 0xC9ED-0xCAF6 <-> 0x224E-0x2438 + 0xCAF7 <-> 0x224D + 0xCAF8-0xD779 <-> 0x2439-0x387D + 0xD77A <-> 0x3F6A + 0xD77B-0xDBA6 <-> 0x387E-0x3F69 + 0xDBA7-0xDDFB <-> 0x3F6B-0x4423 + 0xDDFC <-> 0x4176 # duplicate of 0xDCD1 + 0xDDFD-0xE8A2 <-> 0x4424-0x554A + 0xE8A3-0xE975 <-> 0x554C-0x5721 + 0xE976-0xEB5A <-> 0x5723-0x5A27 + 0xEB5B-0xEBF0 <-> 0x5A29-0x5B3E + 0xEBF1 <-> 0x554B + 0xEBF2-0xECDD <-> 0x5B3F-0x5C69 + 0xECDE <-> 0x5722 + 0xECDF-0xEDA9 <-> 0x5C6A-0x5D73 + 0xEDAA-0xEEEA <-> 0x5D75-0x6038 + 0xEEEB <-> 0x642F + 0xEEEC-0xF055 <-> 0x6039-0x6242 + 0xF056 <-> 0x5D74 + 0xF057-0xF0CA <-> 0x6243-0x6336 + 0xF0CB <-> 0x5A28 + 0xF0CC-0xF162 <-> 0x6337-0x642E + 0xF163-0xF16A <-> 0x6430-0x6437 + 0xF16B <-> 0x6761 + 0xF16C-0xF267 <-> 0x6438-0x6572 + 0xF268 <-> 0x6934 + 0xF269-0xF2C2 <-> 0x6573-0x664C + 0xF2C3-0xF374 <-> 0x664E-0x6760 + 0xF375-0xF465 <-> 0x6762-0x6933 + 0xF466-0xF4B4 <-> 0x6935-0x6961 + 0xF4B5 <-> 0x664D + 0xF4B6-0xF4FC <-> 0x6962-0x6A4A + 0xF4FD-0xF662 <-> 0x6A4C-0x6C51 + 0xF663 <-> 0x6A4B + 0xF664-0xF976 <-> 0x6C52-0x7165 + 0xF977-0xF9C3 <-> 0x7167-0x7233 + 0xF9C4 <-> 0x7166 + 0xF9C5 <-> 0x7234 + 0xF9C6 <-> 0x7240 + 0xF9C7-0xF9D1 <-> 0x7235-0x723F + 0xF9D2-0xF9D5 <-> 0x7241-0x7244 + + +A.4. Big5 (ETen and IBM Version) specific numeric symbols + correspondence to CNS 11643 Plane 1: (Microsoft version defined + this area as UDC - User Defined Character) + + + + + +Zhu, et al Informational [Page 22] + +RFC 1922 Chinese Character Encoding March 1996 + + + 0xC6A1-0xC6BE <-> 0x2621 - 0x263E + +A.5. Big5 (ETen and IBM Version) specific KangXi radicals + correspondence to CNS 11643 Plane 1: (Microsoft version defined as + UDC - User Definable Character) + + 0xC6BF <-> 0x2723 + 0xC6C0 <-> 0x2724 + 0xC6C1 <-> 0x2726 + 0xC6C2 <-> 0x2728 + 0xC6C3 <-> 0x272D + 0xC6C4 <-> 0x272E + 0xC6C5 <-> 0x272F + 0xC6C6 <-> 0x2734 + 0xC6C7 <-> 0x2737 + 0xC6C8 <-> 0x273A + 0xC6C9 <-> 0x273C + 0xC6CA <-> 0x2742 + 0xC6CB <-> 0x2747 + 0xC6CC <-> 0x274E + 0xC6CD <-> 0x2753 + 0xC6CE <-> 0x2754 + 0xC6CF <-> 0x2755 + 0xC6D0 <-> 0x2759 + 0xC6D1 <-> 0x275A + 0xC6D2 <-> 0x2761 + 0xC6D3 <-> 0x2766 + 0xC6D4 <-> 0x2829 + 0xC6D5 <-> 0x282A + 0xC6D6 <-> 0x2863 + 0xC6D7 <-> 0x286C + +A.6. Big5 (ETen and Microsoft version) specific Ideographs + correspondence to CNS 11643 Plane 3: (IBM version defined as UDC) + + 0xF9D6 <-> 0x4337 + 0xF9D7 <-> 0x4F50 + 0xF9D8 <-> 0x444E + 0xF9D9 <-> 0x504A + 0xF9DA <-> 0x2C5D + 0xF9DB <-> 0x3D7E + 0xF9DC <-> 0x4B5C + + +A.7. Big5 (ETen version only) specific symbols correspondence to CNS + 11643 Plane 4: + + 0xC879 <-> 0x2123 + + + +Zhu, et al Informational [Page 23] + +RFC 1922 Chinese Character Encoding March 1996 + + + 0xC87B <-> 0x2124 + 0xC87D <-> 0x212A + 0xC8A2 <-> 0x2152 + +A.8. Other Big5 specific symbols which cannot mapping to CNS 11643: + + 0xC6D8-0xC878 <-> none (ETen and IBM Version) + 0xC87A <-> none (ETen version only) + 0xC87C <-> none (ETen version only) + 0xC87E-0xC8A1 <-> none (ETen version only) + 0xC8A3-0xC8CC <-> none (ETen version only) + 0xC8CD-0xC8D3 <-> none (ETen and IBM version) + 0xF9DD-0xF9FE <-> none (ETen and Microsoft version) + + Note: However, most of them can be mapped to GB-2312 too. For + example, Big5(ETen and IBM version) Hiragana, Katakana, and + Cyrillic symbols correspondence to GB-2312: + + 0xC6E7-0xC77A <-> 0x2421-0x2473 # Japanese Hiragana + 0xC77B-0xC7F2 <-> 0x2521-0x2576 # Japanese Katakana + 0xC7F3-0xC854 <-> 0xA7A1-0xA7C1 # Cyrillic uppercase + 0xC855-0xC875 <-> 0xA7D1-0xA7F1 # Cyrillic lowercase + + Please notice that there are also many symbols that could be + supported by GB-2312, for detail, please refer to the ftp sites in + section 1.4 of the "Specification" part of this document. + + + + + + + + + + + + + + + + + + + + + + + + + +Zhu, et al Informational [Page 24] + +RFC 1922 Chinese Character Encoding March 1996 + + +References + + [ASCII] American National Standards Institute, "Coded character set: + 7-bit American National Standard Code for Information Interchange", + ANSI X3.4-1986. + + [BIG5] Institute for Information Industry, "Chinese Coded Character + Set in Computer ", March, 1984 + + [CJKINF] Ken Lunde, On-line documentation of Chinese/Japanese/Korean + Information Processing, 1995, available at: + ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf + + [CNS-5205] "Information processing: 7-Bit Coded Character Set For + Information Interchange", CNS-5205. + + [CNS-11643] "Chinese Standard Interchange Code", CNS-11643 version + 1992; "Standard Interchange Code for Generally-Used Chinese + Characters", CNS 11643 version 1986. + + [GB-1988] "7-bit Coding Character Set for Information Interchange", + GB 1988-89. + + [GB-2312] "Coding of Chinese Ideogram Set for Information Interchange + Basic Set", GB 2312-80. + + [GB-7589] "Code of Chinese Ideograms Set for Information Interchange, + the 2nd Supplementary Set", UDC 681.3.048, GB 7589-87. + + [GB-7590] "Code of Chinese Ideogram Set for Information Interchange, + the 4th Supplementary Set", UDC 681.3.048, GB 7590-87. + + [GB-8565] "Information Processing Coded Character Sets for Text + Communication", UDC 681.3, GB 8565-88. + + [GB-12345] "Code of Chinese Ideogram Set for Information Interchange + Supplementary Set", GB/T 12345-90. + + [GB-13000] "Information Technology: Universal Multiple-Octet Coded + Character Set(UCS) Part 1: Architecture and Basic Multilingual + Plane", GB13000.1 + + [GB-13131] "Code of Chinese Ideogram Set for Information Interchange, + the 3rd Supplementary Set", GB 13131-91. + + [GB-13132] "Code of Chinese Ideogram Set for Information Interchange, + the 5th Supplementary Set", GB 13132-91. + + + + +Zhu, et al Informational [Page 25] + +RFC 1922 Chinese Character Encoding March 1996 + + + [ISO-646] International Organization for Standardization (ISO), + "Information Technology: ISO 7-bit Coded Character Set for + Information Interchange", International Standard, Ref. No. ISO/IEC + 646:1991. + + [ISO-2022] International Organization for Standardization (ISO), + "Information Processing: ISO 7-bit and 8-bit coded character sets: + Code extension techniques", International Standard, Ref. No. ISO + 2022-1986 (E). + + [ISO-10021] Information Technology: Text communication: + Message-Oriented Text Interchange Systems (MOTIS), ISO 10021, October + 1988. + + [ISO-10646] ISO/IEC 10646-1:1993(E) Information Technology: Universal + Multiple-octet Coded Character Set (UCS) Part 1: Architecture and + Basic Multilingual Plane" + + [ISOREG] International Organization for Standardization (ISO), + "International Register of Coded Character Sets To Be Used With + Escape Sequences". + + [MIME-1] Borenstein, N., and Freed, N., "MIME (Multipurpose Internet + Mail Extensions) Part One: Mechanisms for Specifying and Describing + the Format of Internet Message Bodies", RFC 1521, Bellcore, Innosoft, + September 1993. + + [MIME-2] Moore, K., "MIME (Multipurpose Internet Mail Extensions) + Part Two: Message Header Extensions for Non-ASCII Text", RFC 1522, + University of Tennessee, September 1993. + + [RFC-822] Crocker, D., "Standard for the Format of ARPA Internet Text + Messages", STD 11, RFC 822, University of Delaware, August 1982. + + [RFC-854] Postel, J., Reynolds J., Telnet Protocol Specification, RFC + 854, ISI, May 1983. + + [RFC-1036] Horton, M., and Adams, R., "Standard for Interchange of + USENET Messages", RFC 1036, AT&T Bell Laboratories, Center for + Seismic Studies, December 1987. + + [RFC-1468] Murai J., Crispin, M., and van der Poel, E., Japanese + Character Encoding for Internet Messages, June 1993. + + [RFC-1557] Choi U., Chon K., and Park H., Korean Character Encoding + for Internet Messages, December 1993. + + + + + +Zhu, et al Informational [Page 26] + +RFC 1922 Chinese Character Encoding March 1996 + + + [RFC-1641] Goldsmith D., and Davis M., "Using Unicode with MIME", RFC + 1641, Taligent Inc., July 1994 + + [RFC-1642] Goldsmith D., and Davis M.," UTF-7, A Mail-Safe + Transformation Format of Unicode", July 1994 + + [RFC-1700] Reynolds J., and Postel J., "Assigned Numbers",RFC 1700, + STD 2, ISI, October 1994 + + [SMTP] Postel, J. B. "Simple Mail Transfer Protocol", STD 10, RFC + 821, USC/Information Sciences Institute, August 1982. + + [SMTPEXT] Klensin J., Freed N., Rose M., Stefferud E., and Crocker + D., "SMTP Service Extensions", RFC 1651, July 1994. + + [Unicode 1.1] "The Unicode Standard, Version 1.1", Addison-Wesley, + Reading, MA (to be published; the contents of this standard is + currently available by combining [Unicode92], [Unicode93], and + [Unicode4]). + + [Unicode92] The Unicode Consortium, "The Unicode Standard: Worldwide + Character Encoding: Version 1.0", Volume 1, Addison-Wesley, Reading, + MA, 1992 (ISBN 0-201-56788-1). + + [Unicode93] The Unicode Consortium, "The Unicode Standard: Worldwide + Character Encoding: Version 1.0", Volume 2, Addison-Wesley, Reading, + MA, 1992 (ISBN 0-201-60845-6). + + [Unicode4] The Unicode Consortium, "The Unicode Standard: Version 1.1 + (Prepublication Edition)", Unicode Technical Report #4 (avaliable + from the Unicode Consortium). + + + + + + + + + + + + + + + + + + + + +Zhu, et al Informational [Page 27] + |