diff options
Diffstat (limited to 'doc/rfc/rfc4042.txt')
-rw-r--r-- | doc/rfc/rfc4042.txt | 507 |
1 files changed, 507 insertions, 0 deletions
diff --git a/doc/rfc/rfc4042.txt b/doc/rfc/rfc4042.txt new file mode 100644 index 0000000..d0a5d0f --- /dev/null +++ b/doc/rfc/rfc4042.txt @@ -0,0 +1,507 @@ + + + + + + +Network Working Group M. Crispin +Request for Comments: 4042 Panda Programming +Category: Informational 1 April 2005 + + + UTF-9 and UTF-18 + Efficient Transformation Formats of Unicode + +Status of This Memo + + This memo provides information for the Internet community. It does + not specify an Internet standard of any kind. Distribution of this + memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2005). + +Abstract + + ISO-10646 defines a large character set called the Universal + Character Set (UCS), which encompasses most of the world's writing + systems. The same set of codepoints is defined by Unicode, which + further defines additional character properties and other + implementation details. By policy of the relevant standardization + committees, changes to Unicode and amendments and additions to + ISO/IEC 646 track each other, so that the character repertoires and + code point assignments remain in synchronization. + + The current representation formats for Unicode (UTF-7, UTF-8, UTF-16) + are not storage and computation efficient on platforms that utilize + the 9 bit nonet as a natural storage unit instead of the 8 bit octet. + + This document describes a transformation format of Unicode that takes + advantage of the nonet so that the format will be storage and + computation efficient. + +1. Introduction + + A number of Internet sites utilize platforms that are not based upon + the traditional 8-bit byte or octet. One such platform is the PDP- + 10, which is based upon a 36-bit word. On these platforms, it is + wasteful to represent data in octets, since 4 bits are left unused in + each word. The 9-bit nonet is a much more sensible representation. + + Although these platforms support IETF standards, many of these + platforms still utilize a text representation based upon the septet, + + + + +Crispin Informational [Page 1] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + + which is only suitable for [US-ASCII] (although it has been used for + various ISO 10646 national variants). + + To maximize international and multi-lingual interoperability, the IAB + has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default + coded character set. + + Although other transformation formats of [UNICODE] exist, and + conceivably can be used on nonet-oriented machines (most notably + [UTF-8]), they suffer significant disadvantages: + + [UTF-8] + requires one to three octets to represent codepoints in the + Basic Multilingual Plane (BMP), four octets to represent + [UNICODE] codepoints outside the BMP, and six octets to + represent non-[UNICODE] codepoints. When stored in nonets, + this results in as many as four wasted bits per [UNICODE] + character. + + [UTF-16] + requires a hexadecet to represent codepoints in the BMP, and + two hexadecets to represent [UNICODE] codepoints outside the + BMP. When stored in nonet pairs, this results in as many as + four wasted bits per [UNICODE] character. This transformation + format requires complex surrogates to represent codepoints + outside the BMP, and can not represent non-[UNICODE] codepoints + at all. + + [UTF-7] + requires one to five septets to represent codepoints in the + BMP, and as many as eight septets to represent codepoints + outside the BMP. When stored in nonets, this results in as + many as sixteen wasted bits per character. This transformation + format requires very complex and computationally expensive + shifting and "modified BASE64" processing, and can not + represent non-[UNICODE] codepoints at all. + + By comparison, UTF-9 uses one to two nonets to represent codepoints + in the BMP, three nonets to represent [UNICODE] codepoints outside + the BMP, and three or four nonets to represent non-[UNICODE] + codepoints. There are no wasted bits, and as the examples in this + document demonstrate, the computational processing is minimal. + + Transformation between [UTF-8] and UTF-9 is straightforward, with + most of the complexity in the handling of [UTF-8]. It is hoped that + future extensions to protocols such as SMTP will permit the use of + UTF-9 in these protocols between nonet platforms without the use of + [UTF-8] as an "on the wire" format. + + + +Crispin Informational [Page 2] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + + Similarly, transformation between [UNICODE] codepoints and UTF-18 is + also quite simple. Although (like UCS-2) UTF-18 only represents a + subset of the available [UNICODE] codepoints, it encompasses the + non-private codepoints that are currently assigned in [UNICODE]. + +1.1. Conventions Used in This Document + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in BCP 14, RFC 2119 + [KEYWORDS]. + +2. Overview + + UTF-9 encodes [UNICODE] codepoints in the low order 8 bits of a + nonet, using the high order bit to indicate continuation. Surrogates + are not used. + + [UNICODE] codepoints in the range U+0000 - U+00FF ([US-ASCII] and + Latin 1) are represented by a single nonet; codepoints in the range + U+0100 - U+FFFF (the remainder of the BMP) are represented by two + nonets; and codepoints in the range U+1000 - U+10FFFF (remainder of + [UNICODE]) are represented by three nonets. + + Non-[UNICODE] codepoints in [ISO-10646] (that is, codepoints in the + range 0x110000 - 0x7fffffff) can also be represented in UTF-9 by + obvious extension, but this is not discussed further as these + codepoints have been removed from [ISO-10646] by ISO. + + UTF-18 encodes [UNICODE] codepoints in the Basic Multilingual Plane + (BMP, plane 0), Supplementary Multilingual Plane (SMP, plane 1), + Supplementary Ideographic Plane (SIP, plane 2), and Supplementary + Special-purpose Plane (SSP, plane 14) in a single 18-bit value. It + does not encode planes 3 though 13, which are currently unused; nor + planes 15 or 16, which are private spaces. + + Normally, UTF-9 and UTF-18 should only be used in the context of 9 + bit storage and transport. Although some protocols, e.g., [FTP], + support transport of nonets, the current IETF protocol suite is quite + deficient in this area. The IETF is urged to take action to improve + IETF protocol support for nonets. + +3. UTF-9 Definition + + A UTF-9 stream represents [ISO-10646] codepoints using 9 bit nonets. + The low order 8-bits of a nonet is an octet, and the high order bit + indicates continuation. + + + + +Crispin Informational [Page 3] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + + UTF-9 does not use surrogates; consequently a UTF-16 value must be + transformed into the UCS-4 equivalent, and U+D800 - U+DBFF are never + transmitted in UTF-9. + + Octets of the [UNICODE] codepoint value are then copied into + successive UTF-9 nonets, starting with the most-significant non-zero + octet. All but the least significant octet have the continuation bit + set in the associated nonet. + + Examples: + + Character Name UTF-9 (in octal) + --------- ---- ---------------- + U+0041 LATIN CAPITAL LETTER A 101 + U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 300 + U+0391 GREEK CAPITAL LETTER ALPHA 403 221 + U+611B <CJK ideograph meaning "love"> 541 33 + U+10330 GOTHIC LETTER AHSA 401 403 60 + U+E0041 TAG LATIN CAPITAL LETTER A 416 400 101 + U+10FFFD <Plane 16 Private Use, Last> 420 777 375 + 0x345ecf1b (UCS-4 value not in [UNICODE]) 464 536 717 33 + +4. UTF-18 Definition + + A UTF-18 stream represents [ISO-10646] codepoints using a pair of 9 + bit nonets to form an 18-bit value. + + UTF-18 does not use surrogates; consequently a UTF-16 value must be + transformed into the UCS-4 equivalent, and U+D800 - U+DBFF are never + transmitted in UTF-18. + + [UNICODE] codepoint values in the range U+0000 - U+2FFFF are copied + as the same value into a UTF-18 value. [UNICODE] codepoint values in + the range U+E0000 - U+EFFFF are copied as values 0x30000 - 0x3ffff; + that is, these values are shifted by 0x70000. Other codepoint values + can not be represented in UTF-18. + + Examples: + + Character Name UTF-18 (in octal) + --------- ---- ---------------- + U+0041 LATIN CAPITAL LETTER A 000101 + U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 000300 + U+0391 GREEK CAPITAL LETTER ALPHA 001621 + U+611B <CJK ideograph meaning "love"> 060433 + U+10330 GOTHIC LETTER AHSA 201460 + U+E0041 TAG LATIN CAPITAL LETTER A 600101 + + + + +Crispin Informational [Page 4] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + +5. Sample Routines + +5.1. [UNICODE] Codepoint to UTF-9 Conversion + + The following routines demonstrate conversion from UCS-4 to UTF-9. + For simplicity, these routines do not do any validity checking. + Routines used in applications SHOULD reject invalid UTF-9 sequences; + that is, the first nonet with a value of 400 octal (0x100), or + sequences that result in an overflow (exceeding 0x10ffff for + [UNICODE]), or codepoints used for UTF-16 surrogates. + + ; Return UCS-4 value from UTF-9 string (PDP-10 assembly version) + ; Accepts: P1/ 9-bit byte pointer to UTF-9 string + ; Returns +1: Always, T1/ UCS-4 value, P1/ updated byte pointer + ; Clobbers T2 + + UT92U4: TDZA T1,T1 ; start with zero + U92U41: XOR T1,T2 ; insert octet into UCS-4 value + LSH T1,^D8 ; shift UCS-4 value + ILDB T2,P1 ; get next nonet + TRZE T2,400 ; extract octet, any continuation? + JRST U92U41 ; yes, continue + XOR T1,T2 ; insert final octet + POPJ P, + + /* Return UCS-4 value from UTF-9 string (C version) + * Accepts: pointer to pointer to UTF-9 string + * Returns: UCS-4 character, nonet pointer updated + */ + + UINT31 UTF9_to_UCS4 (UINT9 **utf9PP) + { + UINT9 nonet; + UINT31 ucs4; + for (ucs4 = (nonet = *(*utf9PP)++) & 0xff; + nonet & 0x100; + ucs4 |= (nonet = *(*utf9PP)++) & 0xff) + ucs4 <<= 8; + return ucs4; + } + +5.2. UTF-9 to UCS-4 Conversion + + The following routines demonstrate conversion from UTF-9 to UCS-4. + For simplicity, these routines do not do any validity checking. + Routines used in applications SHOULD reject invalid UCS-4 codepoints; + that is, codepoints used for UTF-16 surrogates or codepoints with + values exceeding 0x10ffff for [UNICODE]. + + + +Crispin Informational [Page 5] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + + ; Write UCS-4 character to UTF-9 string (PDP-10 assembly version) + ; Accepts: P1/ 9-bit byte pointer to UTF-9 string + ; T1/ UCS-4 character to write + ; Returns +1: Always, P1/ updated byte pointer + ; Clobbers T1, T2; (T1, T2) must be an accumulator pair + + U42UT9: SETO T2, ; we'll need some of these 1-bits later + ASHC T1,-^D8 ; low octet becomes nonet with high 0-bit + U32U91: JUMPE T1,U42U9X ; done if no more octets + LSHC T1,-^D8 ; shift next octet into T2 + ROT T2,-1 ; turn it into nonet with high 1 bit + PUSHJ P,U42U91 ; recurse for remainder + U42U9X: LSHC T1,^D9 ; get next nonet back from T2 + IDPB T1,P1 ; write nonet + POPJ P, + + /* Write UCS-4 character to UTF-9 string (C version) + * Accepts: pointer to nonet string + * UCS-4 character to write + * Returns: updated pointer + */ + + UINT9 *UCS4_to_UTF9 (UINT9 *utf9P,UINT31 ucs4) + { + if (ucs4 > 0x100) { + if (ucs4 > 0x10000) { + if (ucs4 > 0x1000000) + *utf9P++ = 0x100 | ((ucs4 >> 24) & 0xff); + *utf9P++ = 0x100 | ((ucs4 >> 16) & 0xff); + } + *utf9P++ = 0x100 | ((ucs4 >> 8) & 0xff); + } + *utf9P++ = ucs4 & 0xff; + return utf9P; + } + +6. Implementation Experience + + As the sample routines demonstrate, it is quite simple to implement + UTF-9 and UTF-18 on a nonet-based architecture. More sophisticated + routines can be found in ftp://panda.com/tops-20/utools.mac.txt or + from lingling.panda.com via the file <UTF9>UTOOLS.MAC via ANONYMOUS + [FTP]. + + + + + + + + +Crispin Informational [Page 6] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + + We are now in the process of implementing support for nonet-based + text files and automated transformation between septet, octet, and + nonet textual data. + +7. References + +7.1. Normative References + + [FTP] Postel, J. and J. Reynolds, "File Transfer Protocol", + STD 9, RFC 959, October 1985. + + [IAB-CHARACTER] Weider, C., Preston, C., Simonsen, K., Alvestrand, + H., Atkinson, R., Crispin, M., and P. Svanberg, "The + Report of the IAB Character Set Workshop held 29 + February - 1 March, 1996", RFC 2130, April 1997. + + [ISO-10646] International Organization for Standardization, + "Information Technology - Universal Multiple-octet + coded Character Set (UCS)", ISO/IEC Standard 10646, + comprised of ISO/IEC 10646-1:2000, "Information + technology - Universal Multiple-Octet Coded Character + Set (UCS) - Part 1: Architecture and Basic + Multilingual Plane", ISO/IEC 10646-2:2001, + "Information technology - Universal Multiple-Octet + Coded Character Set (UCS) - Part 2: Supplementary + Planes" and ISO/IEC 10646-1:2000/Amd 1:2002, + "Mathematical symbols and other characters". + + [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + [UNICODE] The Unicode Consortium, "The Unicode Standard - + Version 3.2", defined by The Unicode Standard, + Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN + 0-201-61633-5), as amended by the Unicode Standard + Annex #27: Unicode 3.1 and by the Unicode Standard + Annex #28: Unicode 3.2, March 2002. + +7.2. Informative References + + [US-ASCII] American National Standards Institute, "Coded + Character Set - 7-bit American Standard Code for + Information Interchange", ANSI X3.4, 1986. + + [UTF-16] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of + ISO 10646", RFC 2781, February 2000. + + + + + +Crispin Informational [Page 7] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + + [UTF-7] Goldsmith, D. and M. Davis, "UTF-7 A Mail-Safe + Transformation Format of Unicode", RFC 2152, May + 1997. + + [UTF-8] Sollins, K., "Architectural Principles of Uniform + Resource Name Resolution", RFC 2276, January 1998. + +8. Security Considerations + + As with UTF-8, UTF-9 can represent codepoints that are not in + [UNICODE]. Applications should validate UTF-9 strings to ensure that + all codepoints do not exceed the [UNICODE] maximum of U+10FFFF. + + The sample routines in this document are for example purposes, and + make no attempt to validate their arguments, e.g., test for overflow + ([UNICODE] values great than 0x10ffff) or codepoints used for + surrogates. Besides resulting in invalid data, this can also create + covert channels. + +9. IANA Considerations + + The IANA shall reserve the charset names "UTF-9" and "UTF-18" for + future assignment. + +Author's Address + + Mark R. Crispin + Panda Programming + 6158 NE Lariat Loop + Bainbridge Island, WA 98110-2098 + + Phone: (206) 842-2385 + EMail: UTF9@Lingling.Panda.COM + + + + + + + + + + + + + + + + + + +Crispin Informational [Page 8] + +RFC 4042 UTF-9 and UTF-18 1 April 2005 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2005). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78 and at www.rfc-editor.org/copyright.html, and + except as set forth therein, the authors retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET + ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, + INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE + INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the procedures with respect to rights in RFC documents can be + found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at ietf- + ipr@ietf.org. + +Acknowledgement + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + + +Crispin Informational [Page 9] + |