summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc4042.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc4042.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc4042.txt')
-rw-r--r--doc/rfc/rfc4042.txt507
1 files changed, 507 insertions, 0 deletions
diff --git a/doc/rfc/rfc4042.txt b/doc/rfc/rfc4042.txt
new file mode 100644
index 0000000..d0a5d0f
--- /dev/null
+++ b/doc/rfc/rfc4042.txt
@@ -0,0 +1,507 @@
+
+
+
+
+
+
+Network Working Group M. Crispin
+Request for Comments: 4042 Panda Programming
+Category: Informational 1 April 2005
+
+
+ UTF-9 and UTF-18
+ Efficient Transformation Formats of Unicode
+
+Status of This Memo
+
+ This memo provides information for the Internet community. It does
+ not specify an Internet standard of any kind. Distribution of this
+ memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2005).
+
+Abstract
+
+ ISO-10646 defines a large character set called the Universal
+ Character Set (UCS), which encompasses most of the world's writing
+ systems. The same set of codepoints is defined by Unicode, which
+ further defines additional character properties and other
+ implementation details. By policy of the relevant standardization
+ committees, changes to Unicode and amendments and additions to
+ ISO/IEC 646 track each other, so that the character repertoires and
+ code point assignments remain in synchronization.
+
+ The current representation formats for Unicode (UTF-7, UTF-8, UTF-16)
+ are not storage and computation efficient on platforms that utilize
+ the 9 bit nonet as a natural storage unit instead of the 8 bit octet.
+
+ This document describes a transformation format of Unicode that takes
+ advantage of the nonet so that the format will be storage and
+ computation efficient.
+
+1. Introduction
+
+ A number of Internet sites utilize platforms that are not based upon
+ the traditional 8-bit byte or octet. One such platform is the PDP-
+ 10, which is based upon a 36-bit word. On these platforms, it is
+ wasteful to represent data in octets, since 4 bits are left unused in
+ each word. The 9-bit nonet is a much more sensible representation.
+
+ Although these platforms support IETF standards, many of these
+ platforms still utilize a text representation based upon the septet,
+
+
+
+
+Crispin Informational [Page 1]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+ which is only suitable for [US-ASCII] (although it has been used for
+ various ISO 10646 national variants).
+
+ To maximize international and multi-lingual interoperability, the IAB
+ has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default
+ coded character set.
+
+ Although other transformation formats of [UNICODE] exist, and
+ conceivably can be used on nonet-oriented machines (most notably
+ [UTF-8]), they suffer significant disadvantages:
+
+ [UTF-8]
+ requires one to three octets to represent codepoints in the
+ Basic Multilingual Plane (BMP), four octets to represent
+ [UNICODE] codepoints outside the BMP, and six octets to
+ represent non-[UNICODE] codepoints. When stored in nonets,
+ this results in as many as four wasted bits per [UNICODE]
+ character.
+
+ [UTF-16]
+ requires a hexadecet to represent codepoints in the BMP, and
+ two hexadecets to represent [UNICODE] codepoints outside the
+ BMP. When stored in nonet pairs, this results in as many as
+ four wasted bits per [UNICODE] character. This transformation
+ format requires complex surrogates to represent codepoints
+ outside the BMP, and can not represent non-[UNICODE] codepoints
+ at all.
+
+ [UTF-7]
+ requires one to five septets to represent codepoints in the
+ BMP, and as many as eight septets to represent codepoints
+ outside the BMP. When stored in nonets, this results in as
+ many as sixteen wasted bits per character. This transformation
+ format requires very complex and computationally expensive
+ shifting and "modified BASE64" processing, and can not
+ represent non-[UNICODE] codepoints at all.
+
+ By comparison, UTF-9 uses one to two nonets to represent codepoints
+ in the BMP, three nonets to represent [UNICODE] codepoints outside
+ the BMP, and three or four nonets to represent non-[UNICODE]
+ codepoints. There are no wasted bits, and as the examples in this
+ document demonstrate, the computational processing is minimal.
+
+ Transformation between [UTF-8] and UTF-9 is straightforward, with
+ most of the complexity in the handling of [UTF-8]. It is hoped that
+ future extensions to protocols such as SMTP will permit the use of
+ UTF-9 in these protocols between nonet platforms without the use of
+ [UTF-8] as an "on the wire" format.
+
+
+
+Crispin Informational [Page 2]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+ Similarly, transformation between [UNICODE] codepoints and UTF-18 is
+ also quite simple. Although (like UCS-2) UTF-18 only represents a
+ subset of the available [UNICODE] codepoints, it encompasses the
+ non-private codepoints that are currently assigned in [UNICODE].
+
+1.1. Conventions Used in This Document
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in BCP 14, RFC 2119
+ [KEYWORDS].
+
+2. Overview
+
+ UTF-9 encodes [UNICODE] codepoints in the low order 8 bits of a
+ nonet, using the high order bit to indicate continuation. Surrogates
+ are not used.
+
+ [UNICODE] codepoints in the range U+0000 - U+00FF ([US-ASCII] and
+ Latin 1) are represented by a single nonet; codepoints in the range
+ U+0100 - U+FFFF (the remainder of the BMP) are represented by two
+ nonets; and codepoints in the range U+1000 - U+10FFFF (remainder of
+ [UNICODE]) are represented by three nonets.
+
+ Non-[UNICODE] codepoints in [ISO-10646] (that is, codepoints in the
+ range 0x110000 - 0x7fffffff) can also be represented in UTF-9 by
+ obvious extension, but this is not discussed further as these
+ codepoints have been removed from [ISO-10646] by ISO.
+
+ UTF-18 encodes [UNICODE] codepoints in the Basic Multilingual Plane
+ (BMP, plane 0), Supplementary Multilingual Plane (SMP, plane 1),
+ Supplementary Ideographic Plane (SIP, plane 2), and Supplementary
+ Special-purpose Plane (SSP, plane 14) in a single 18-bit value. It
+ does not encode planes 3 though 13, which are currently unused; nor
+ planes 15 or 16, which are private spaces.
+
+ Normally, UTF-9 and UTF-18 should only be used in the context of 9
+ bit storage and transport. Although some protocols, e.g., [FTP],
+ support transport of nonets, the current IETF protocol suite is quite
+ deficient in this area. The IETF is urged to take action to improve
+ IETF protocol support for nonets.
+
+3. UTF-9 Definition
+
+ A UTF-9 stream represents [ISO-10646] codepoints using 9 bit nonets.
+ The low order 8-bits of a nonet is an octet, and the high order bit
+ indicates continuation.
+
+
+
+
+Crispin Informational [Page 3]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+ UTF-9 does not use surrogates; consequently a UTF-16 value must be
+ transformed into the UCS-4 equivalent, and U+D800 - U+DBFF are never
+ transmitted in UTF-9.
+
+ Octets of the [UNICODE] codepoint value are then copied into
+ successive UTF-9 nonets, starting with the most-significant non-zero
+ octet. All but the least significant octet have the continuation bit
+ set in the associated nonet.
+
+ Examples:
+
+ Character Name UTF-9 (in octal)
+ --------- ---- ----------------
+ U+0041 LATIN CAPITAL LETTER A 101
+ U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 300
+ U+0391 GREEK CAPITAL LETTER ALPHA 403 221
+ U+611B <CJK ideograph meaning "love"> 541 33
+ U+10330 GOTHIC LETTER AHSA 401 403 60
+ U+E0041 TAG LATIN CAPITAL LETTER A 416 400 101
+ U+10FFFD <Plane 16 Private Use, Last> 420 777 375
+ 0x345ecf1b (UCS-4 value not in [UNICODE]) 464 536 717 33
+
+4. UTF-18 Definition
+
+ A UTF-18 stream represents [ISO-10646] codepoints using a pair of 9
+ bit nonets to form an 18-bit value.
+
+ UTF-18 does not use surrogates; consequently a UTF-16 value must be
+ transformed into the UCS-4 equivalent, and U+D800 - U+DBFF are never
+ transmitted in UTF-18.
+
+ [UNICODE] codepoint values in the range U+0000 - U+2FFFF are copied
+ as the same value into a UTF-18 value. [UNICODE] codepoint values in
+ the range U+E0000 - U+EFFFF are copied as values 0x30000 - 0x3ffff;
+ that is, these values are shifted by 0x70000. Other codepoint values
+ can not be represented in UTF-18.
+
+ Examples:
+
+ Character Name UTF-18 (in octal)
+ --------- ---- ----------------
+ U+0041 LATIN CAPITAL LETTER A 000101
+ U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 000300
+ U+0391 GREEK CAPITAL LETTER ALPHA 001621
+ U+611B <CJK ideograph meaning "love"> 060433
+ U+10330 GOTHIC LETTER AHSA 201460
+ U+E0041 TAG LATIN CAPITAL LETTER A 600101
+
+
+
+
+Crispin Informational [Page 4]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+5. Sample Routines
+
+5.1. [UNICODE] Codepoint to UTF-9 Conversion
+
+ The following routines demonstrate conversion from UCS-4 to UTF-9.
+ For simplicity, these routines do not do any validity checking.
+ Routines used in applications SHOULD reject invalid UTF-9 sequences;
+ that is, the first nonet with a value of 400 octal (0x100), or
+ sequences that result in an overflow (exceeding 0x10ffff for
+ [UNICODE]), or codepoints used for UTF-16 surrogates.
+
+ ; Return UCS-4 value from UTF-9 string (PDP-10 assembly version)
+ ; Accepts: P1/ 9-bit byte pointer to UTF-9 string
+ ; Returns +1: Always, T1/ UCS-4 value, P1/ updated byte pointer
+ ; Clobbers T2
+
+ UT92U4: TDZA T1,T1 ; start with zero
+ U92U41: XOR T1,T2 ; insert octet into UCS-4 value
+ LSH T1,^D8 ; shift UCS-4 value
+ ILDB T2,P1 ; get next nonet
+ TRZE T2,400 ; extract octet, any continuation?
+ JRST U92U41 ; yes, continue
+ XOR T1,T2 ; insert final octet
+ POPJ P,
+
+ /* Return UCS-4 value from UTF-9 string (C version)
+ * Accepts: pointer to pointer to UTF-9 string
+ * Returns: UCS-4 character, nonet pointer updated
+ */
+
+ UINT31 UTF9_to_UCS4 (UINT9 **utf9PP)
+ {
+ UINT9 nonet;
+ UINT31 ucs4;
+ for (ucs4 = (nonet = *(*utf9PP)++) & 0xff;
+ nonet & 0x100;
+ ucs4 |= (nonet = *(*utf9PP)++) & 0xff)
+ ucs4 <<= 8;
+ return ucs4;
+ }
+
+5.2. UTF-9 to UCS-4 Conversion
+
+ The following routines demonstrate conversion from UTF-9 to UCS-4.
+ For simplicity, these routines do not do any validity checking.
+ Routines used in applications SHOULD reject invalid UCS-4 codepoints;
+ that is, codepoints used for UTF-16 surrogates or codepoints with
+ values exceeding 0x10ffff for [UNICODE].
+
+
+
+Crispin Informational [Page 5]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+ ; Write UCS-4 character to UTF-9 string (PDP-10 assembly version)
+ ; Accepts: P1/ 9-bit byte pointer to UTF-9 string
+ ; T1/ UCS-4 character to write
+ ; Returns +1: Always, P1/ updated byte pointer
+ ; Clobbers T1, T2; (T1, T2) must be an accumulator pair
+
+ U42UT9: SETO T2, ; we'll need some of these 1-bits later
+ ASHC T1,-^D8 ; low octet becomes nonet with high 0-bit
+ U32U91: JUMPE T1,U42U9X ; done if no more octets
+ LSHC T1,-^D8 ; shift next octet into T2
+ ROT T2,-1 ; turn it into nonet with high 1 bit
+ PUSHJ P,U42U91 ; recurse for remainder
+ U42U9X: LSHC T1,^D9 ; get next nonet back from T2
+ IDPB T1,P1 ; write nonet
+ POPJ P,
+
+ /* Write UCS-4 character to UTF-9 string (C version)
+ * Accepts: pointer to nonet string
+ * UCS-4 character to write
+ * Returns: updated pointer
+ */
+
+ UINT9 *UCS4_to_UTF9 (UINT9 *utf9P,UINT31 ucs4)
+ {
+ if (ucs4 > 0x100) {
+ if (ucs4 > 0x10000) {
+ if (ucs4 > 0x1000000)
+ *utf9P++ = 0x100 | ((ucs4 >> 24) & 0xff);
+ *utf9P++ = 0x100 | ((ucs4 >> 16) & 0xff);
+ }
+ *utf9P++ = 0x100 | ((ucs4 >> 8) & 0xff);
+ }
+ *utf9P++ = ucs4 & 0xff;
+ return utf9P;
+ }
+
+6. Implementation Experience
+
+ As the sample routines demonstrate, it is quite simple to implement
+ UTF-9 and UTF-18 on a nonet-based architecture. More sophisticated
+ routines can be found in ftp://panda.com/tops-20/utools.mac.txt or
+ from lingling.panda.com via the file <UTF9>UTOOLS.MAC via ANONYMOUS
+ [FTP].
+
+
+
+
+
+
+
+
+Crispin Informational [Page 6]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+ We are now in the process of implementing support for nonet-based
+ text files and automated transformation between septet, octet, and
+ nonet textual data.
+
+7. References
+
+7.1. Normative References
+
+ [FTP] Postel, J. and J. Reynolds, "File Transfer Protocol",
+ STD 9, RFC 959, October 1985.
+
+ [IAB-CHARACTER] Weider, C., Preston, C., Simonsen, K., Alvestrand,
+ H., Atkinson, R., Crispin, M., and P. Svanberg, "The
+ Report of the IAB Character Set Workshop held 29
+ February - 1 March, 1996", RFC 2130, April 1997.
+
+ [ISO-10646] International Organization for Standardization,
+ "Information Technology - Universal Multiple-octet
+ coded Character Set (UCS)", ISO/IEC Standard 10646,
+ comprised of ISO/IEC 10646-1:2000, "Information
+ technology - Universal Multiple-Octet Coded Character
+ Set (UCS) - Part 1: Architecture and Basic
+ Multilingual Plane", ISO/IEC 10646-2:2001,
+ "Information technology - Universal Multiple-Octet
+ Coded Character Set (UCS) - Part 2: Supplementary
+ Planes" and ISO/IEC 10646-1:2000/Amd 1:2002,
+ "Mathematical symbols and other characters".
+
+ [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+ [UNICODE] The Unicode Consortium, "The Unicode Standard -
+ Version 3.2", defined by The Unicode Standard,
+ Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN
+ 0-201-61633-5), as amended by the Unicode Standard
+ Annex #27: Unicode 3.1 and by the Unicode Standard
+ Annex #28: Unicode 3.2, March 2002.
+
+7.2. Informative References
+
+ [US-ASCII] American National Standards Institute, "Coded
+ Character Set - 7-bit American Standard Code for
+ Information Interchange", ANSI X3.4, 1986.
+
+ [UTF-16] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of
+ ISO 10646", RFC 2781, February 2000.
+
+
+
+
+
+Crispin Informational [Page 7]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+ [UTF-7] Goldsmith, D. and M. Davis, "UTF-7 A Mail-Safe
+ Transformation Format of Unicode", RFC 2152, May
+ 1997.
+
+ [UTF-8] Sollins, K., "Architectural Principles of Uniform
+ Resource Name Resolution", RFC 2276, January 1998.
+
+8. Security Considerations
+
+ As with UTF-8, UTF-9 can represent codepoints that are not in
+ [UNICODE]. Applications should validate UTF-9 strings to ensure that
+ all codepoints do not exceed the [UNICODE] maximum of U+10FFFF.
+
+ The sample routines in this document are for example purposes, and
+ make no attempt to validate their arguments, e.g., test for overflow
+ ([UNICODE] values great than 0x10ffff) or codepoints used for
+ surrogates. Besides resulting in invalid data, this can also create
+ covert channels.
+
+9. IANA Considerations
+
+ The IANA shall reserve the charset names "UTF-9" and "UTF-18" for
+ future assignment.
+
+Author's Address
+
+ Mark R. Crispin
+ Panda Programming
+ 6158 NE Lariat Loop
+ Bainbridge Island, WA 98110-2098
+
+ Phone: (206) 842-2385
+ EMail: UTF9@Lingling.Panda.COM
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Crispin Informational [Page 8]
+
+RFC 4042 UTF-9 and UTF-18 1 April 2005
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2005).
+
+ This document is subject to the rights, licenses and restrictions
+ contained in BCP 78 and at www.rfc-editor.org/copyright.html, and
+ except as set forth therein, the authors retain all their rights.
+
+ This document and the information contained herein are provided on an
+ "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+ OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+ ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+ INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+ INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+ WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+ The IETF takes no position regarding the validity or scope of any
+ Intellectual Property Rights or other rights that might be claimed to
+ pertain to the implementation or use of the technology described in
+ this document or the extent to which any license under such rights
+ might or might not be available; nor does it represent that it has
+ made any independent effort to identify any such rights. Information
+ on the procedures with respect to rights in RFC documents can be
+ found in BCP 78 and BCP 79.
+
+ Copies of IPR disclosures made to the IETF Secretariat and any
+ assurances of licenses to be made available, or the result of an
+ attempt made to obtain a general license or permission for the use of
+ such proprietary rights by implementers or users of this
+ specification can be obtained from the IETF on-line IPR repository at
+ http://www.ietf.org/ipr.
+
+ The IETF invites any interested party to bring to its attention any
+ copyrights, patents or patent applications, or other proprietary
+ rights that may cover technology that may be required to implement
+ this standard. Please address the information to the IETF at ietf-
+ ipr@ietf.org.
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+
+Crispin Informational [Page 9]
+