diff options
Diffstat (limited to 'doc/rfc/rfc5198.txt')
-rw-r--r-- | doc/rfc/rfc5198.txt | 1067 |
1 files changed, 1067 insertions, 0 deletions
diff --git a/doc/rfc/rfc5198.txt b/doc/rfc/rfc5198.txt new file mode 100644 index 0000000..aa55dcc --- /dev/null +++ b/doc/rfc/rfc5198.txt @@ -0,0 +1,1067 @@ + + + + + + +Network Working Group J. Klensin +Request for Comments: 5198 M. Padlipsky +Obsoletes: 698 March 2008 +Updates: 854 +Category: Standards Track + + + Unicode Format for Network Interchange + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Abstract + + The Internet today is in need of a standardized form for the + transmission of internationalized "text" information, paralleling the + specifications for the use of ASCII that date from the early days of + the ARPANET. This document specifies that format, using UTF-8 with + normalization and specific line-ending sequences. + +Table of Contents + + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 + 1.1. Requirement for a Standardized Text Stream Format . . . . 2 + 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 + 2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 3 + 3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 5 + 4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 5 + 5. Applicability and Stability of this Specification . . . . . . 7 + 5.1. Use in IETF Applications Specifications . . . . . . . . . 7 + 5.2. Unicode Versions and Applicability . . . . . . . . . . . . 7 + 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 + 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 + Appendix A. History and Context . . . . . . . . . . . . . . . . . 11 + Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 12 + Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14 + Appendix D. A Note about Related Future Work . . . . . . . . . . 14 + References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 + Normative References . . . . . . . . . . . . . . . . . . . . . . 15 + Informative References . . . . . . . . . . . . . . . . . . . . . 16 + + + + + + +Klensin & Padlipsky Standards Track [Page 1] + +RFC 5198 Network Unicode March 2008 + + +1. Introduction + +1.1. Requirement for a Standardized Text Stream Format + + Historically, Internet protocols have been largely ASCII-based and + references to "text" in protocols have assumed ASCII text and + specifically text in Network Virtual Terminal ("NVT") or "Network + ASCII" form (see Appendix A and Appendix B). Protocols and formats + that have moved beyond ASCII have included arrangements to + specifically identify the character set and often the language being + used. + + In our more internationalized world, "text" clearly no longer equates + unambiguously to "network ASCII". Fortunately, however, we are + converging on Unicode [Unicode] [ISO10646] as a single international + interchange character coding and no longer need to deal with per- + script standards for character sets (e.g., one standard for each of + Arabic, Cyrillic, Devanagari, etc., or even standards keyed to + languages that are usually considered to share a script, such as + French, German, or Swedish). Unfortunately, though, while it is + certainly time to define a Unicode-based text type for use as a + common text interchange format, "use Unicode" involves even more + ambiguity than "use ASCII" did decades ago. + + Unicode identifies each character by an integer, called its "code + point", in the range 0-0x10ffff. These integers can be encoded into + byte sequences for transmission in at least three standard and + generally-recognized encoding forms, all of which are completely + defined in The Unicode Standard and the documents cited below: + + o UTF-8 [RFC3629] defines a variable-length encoding that may be + applied uniformly to all code points. + + o UTF-16 [RFC2781] encodes the range of Unicode characters whose + code points are less than 65536 straightforwardly as 16-bit + integers, and provides a "surrogate" mechanism for encoding larger + code points in 32 bits. + + o UTF-32 (also known as UCS-4) simply encodes each code point as a + 32-bit integer. + + Older forms and nomenclature, such as the 16-bit UCS-2, are now + strongly discouraged. + + As with ASCII, any of these forms may be used with different line- + ending conventions. That flexibility can be an additional source of + confusion with, e.g., index (offset) references into documents based + on character counts. + + + +Klensin & Padlipsky Standards Track [Page 2] + +RFC 5198 Network Unicode March 2008 + + + This document proposes to establish "Net-Unicode" as a new + standardized text transmission form for the Internet, to serve as an + internationalized alternative for NVT ASCII when specified in new -- + and, where appropriate, updated -- protocols. UTF-8 [RFC3629] is + chosen for the coding because it has good compatibility properties + with ASCII and for other reasons discussed in the existing IETF + character set policy [RFC2277]. "Net-Unicode" is specified in + Section 2; the subsequent sections of the document provide background + and explanation. + + Whenever there is a choice, Unicode SHOULD be used with the text + encoding specified here. This combination is preferred to the + double-byte encoding of "extended ASCII" [RFC0698] or the assorted + per-language or per-country character coding systems. + +1.2. Terminology + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in [RFC2119]. + +2. Net-Unicode Definition + + The Network Unicode format (Net-Unicode) is defined as follows. + Parts of this definition are deliberately informal, providing + guidance for specific profiles or rules in the protocols that + reference this one rather than firm rules that apply globally. + + 1. Characters MUST be encoded in UTF-8 as defined in [RFC3629]. + + 2. If the protocol has the concept of "lines", line-endings MUST be + indicated by the sequence Carriage-Return (CR, U+000D) followed + by Line-Feed (LF, U+000A), often known just as CRLF. CR SHOULD + NOT appear except when followed by LF. The only other allowed + context in which CR is permitted is in the combination CR NUL, + which is not recommended (see the note at the end of this + section). + + 3. The control characters in the ASCII range (U+0000 to U+001F and + U+007F to U+009F) SHOULD generally be avoided. Space (SP, + U+0020), CR, LF, and Form Feed (FF, U+000C) are exceptions to + this principle, but use of all but the first requires care as + discussed elsewhere in this document. The so-called "C1 + Controls" (U+0080 through U+009F), which did not appear in ASCII, + MUST NOT appear. + + FF should be used only with caution: it does not have a standard + and universal interpretation and, in particular, if its use + + + +Klensin & Padlipsky Standards Track [Page 3] + +RFC 5198 Network Unicode March 2008 + + + assumes a page length, such assumptions may not be appropriate in + international contexts (e.g., considering 8.5x11 inch paper + versus A4). Other control characters are used to affect display + format, control devices, or to structure files. None of those + uses is appropriate for streams of plain text. + + 4. Before transmission, all character sequences SHOULD be normalized + according to Unicode normalization form "NFC" (see Section 3). + + 5. As suggested in Section 6 of RFC 3629, the Byte Order Mark + ("BOM") signature MUST NOT appear at the beginning of these text + strings. + + 6. Systems conforming to this specification MUST NOT transmit any + string containing any code point that is unassigned in the + version of Unicode on which they are dependent. The version of + NFC and the version of Unicode used by that system MUST be + consistent. + + The use of LF without CR is questionable; see Appendix B for more + discussion. The newer control characters IND (U+0084) and NEL ("Next + Line", U+0085) might have been used to disambiguate the various line- + ending situations, but, because their use has not been established on + the Internet, because many protocols require CRLF, and because IND + and NEL fall within the "C1 Controls" group (see below), they MUST + NOT be used. Similar observations apply to the yet newer line and + paragraph separators at U+2028 and U+2029 and any future characters + that might be defined to serve these functions. For this + specification and protocols that depend on it, lines end in CRLF and + only in CRLF. Anything that does not end in CRLF is either not a + line or is severely malformed. + + The NVT specification contained a number of additional provisions, + e.g., for the optional use of backspacing and "bare CR" (sent as CR + NUL) to generate overstruck character sequences. The much greater + number of precomposed characters in Unicode, the availability of + combining characters, and the growing use of markup conventions of + various types to show, e.g., emphasis (rather than attempting to do + that via the use of special characters), should make such sequences + largely unnecessary. These sequences SHOULD be avoided if at all + possible. However, because they were optional in NVT applications + and this specification is an NVT superset, they cannot be prohibited + entirely. The most important of these rules is that CR MUST NOT + appear unless it is immediately followed by LF (indicating end of + line) or NUL. Because NUL (an octet whose value is all zeros, i.e., + %x00 in the notation of [RFC5234]) is hostile to programming + languages that use that character as a string delimiter, the CR NUL + sequence SHOULD be avoided for that reason as well. + + + +Klensin & Padlipsky Standards Track [Page 4] + +RFC 5198 Network Unicode March 2008 + + +3. Normalization + + There are cases where strings of Unicode are fundamentally + equivalent, essentially representing the same text. These are called + "canonical equivalents" in the Unicode Standard. For example, the + following pairs of strings are canonically equivalent: + + U+2126 OHM SIGN + U+03A9 GREEK CAPITAL LETTER OMEGA + + U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT + U+00E0 LATIN SMALL LETTER A WITH GRAVE + + Comparison of strings becomes much easier if any such cases are + always represented by a single unique form. The Unicode Consortium + specifies a normalization form, known as NFC [NFC], which provides + the necessary mappings and mechanisms to convert all canonically + equivalent sequences to a single unique form. Typically, this form + produces precomposed characters for any sequences that can be + represented in that fashion. It also reorders other combining marks + so that they have a unique and unambiguous order. + + Of the various normalization forms defined as part of Unicode, NFC is + closest to actual use in practice, minimizes side-effects due to + considering characters equivalent that may not be equivalent in all + situations, and typically requires the least work when converting + from non-Unicode encodings. + + The section above requires that, except in very unusual + circumstances, all Net-Unicode strings be transmitted in normalized + form. Recognition of the fact that some implementations of + applications may rely on operating system libraries over which they + have little control and adherence to the robustness principle + suggests that receivers of such strings should be prepared to receive + unnormalized ones and to not react to that in excessive ways. + +4. Versions of Unicode + + Unicode changes and expands over time. Large blocks of space are + reserved for future expansion. New versions, which appear at regular + intervals, add new scripts and characters. Occasionally they also + change some property definitions. In retrospect, one of the + advantages of ASCII [ASCII] when it was chosen was that the code + space was full when the Standard was first published. There was no + practical way to add characters or change code point assignments + without being obviously incompatible. + + + + + +Klensin & Padlipsky Standards Track [Page 5] + +RFC 5198 Network Unicode March 2008 + + + While there are some security issues if people deliberately try to + trick the system (see Section 6), Unicode version changes should not + have a significant impact on the text stream specification of this + document for the following reasons: + + o The transformation between Unicode code table positions and the + corresponding UTF-8 code is algorithmic; it does not depend on + whether a code point has been assigned or not. + + o The normalization recommended here, NFC (see Section 3), performs + a very limited set of mappings, much more limited than those of + the more extensive NFKC used in, e.g., Nameprep [RFC3491]. + + The NFC tables may be updated over time as new characters are added, + but the Unicode Consortium has guaranteed the stability of all NFC + strings. That is, if a string does not contain any unassigned + characters, and it is normalized according to NFC, it will always be + normalized according to all future versions of the Unicode Standard. + The stability of the Net-Unicode format is thus guaranteed when any + implementation that converts text into Net-Unicode format does not + permit unassigned characters. + + Because Unicode code points that are reserved for private use do not + have standard definitions or normalization interpretations, they + SHOULD be avoided in strings intended for Internet interchange. + + Were Unicode to be changed in a way that violated these assumptions, + i.e., that either invalidated the byte string order specified in RFC + 3629 or that changed the stability of NFC as stated above, this + specification would not apply. Put differently, this specification + applies only to versions of Unicode starting with version 5.0 and + extending to, but not including, any version for which changes are + made in either the UTF-8 definition or to NFC stability. Such + changes would violate established Unicode policies and are hence + unlikely, but, should they occur, it would be necessary to evaluate + them for compatibility with this specification and other Internet + uses of NFC. + + If the specification of a protocol references this one, strings that + are received by that protocol and that appear to be UTF-8 and are not + otherwise identified (e.g., by charset labeling) SHOULD be treated as + using UTF-8 in conformance with this specification. + + + + + + + + + +Klensin & Padlipsky Standards Track [Page 6] + +RFC 5198 Network Unicode March 2008 + + +5. Applicability and Stability of this Specification + +5.1. Use in IETF Applications Specifications + + During the development of this specification, there was some + confusion about where it would be useful given that, e.g., the + individual MIME media types used in email and with HTTP have their + own rules about UTF-8 character types and normalization, and the + application transport protocols impose their own conventions about + line endings. There are three answers. The first is that, in + retrospect, it would have been better to have those protocols and + content types standardized in the way specified here, even though it + is certainly too late to change them at this time. The second is + that we have several protocols that are dependent on either the + original Telnet design or other arrangements requiring a standard, + interoperable, string definition without specific content-labels of + one sort or another. Whois [RFC3912] is an example member of this + group. As consideration is given to upgrading them for non-ASCII + use, this specification provides a normative reference that provides + the same stability that NVT has provided the ASCII forms. This + specification is intended for use by other specifications that have + not yet defined how to use Unicode. Having a preferred standard + Internet definition for Unicode text streams -- rather than just one + for transmission codings -- may help improve the specification and + interoperability of protocols to be developed in the future. This + specification is not intended for use with specifications that + already allow the use of UTF-8 and precisely define that use. + +5.2. Unicode Versions and Applicability + + The IETF faces a practical dilemma with regard to versions of + Unicode. Each new version brings with it new characters and + sometimes new combining characters. Version 5.0 introduces the new + concept of sequences of characters named as if they were individual + characters (see [NamedSequences]). The normalization represented by + NFC is stable if all strings are transmitted and stored in normalized + form if corrections are never made to character definitions or + normalization tables and if unassigned code points are never used. + The latter is important because an unassigned code point always + normalizes to itself. However, if the same code point is assigned to + a character in a future version, it may participate in some other + normalization mapping (some specific difficulties in this regard are + discussed in [RFC4690]). It is worth noting that transmission in + normalized form is not required by either the IETF's UTF-8 Standard + [RFC3629] or by standards dependent on the current version of + Stringprep [RFC3454]. + + + + + +Klensin & Padlipsky Standards Track [Page 7] + +RFC 5198 Network Unicode March 2008 + + + All would be well with this as described in Section 4 except for one + problem: Applications typically do not perform their own conversions + to Unicode and may not perform their own normalizations but instead + rely on operating system or language library functions -- functions + that may be upgraded or otherwise changed without changes to the + application code itself. Consequently, there may be no plausible way + for an application to know which version of Unicode, or which version + of the normalization procedures, it is utilizing, nor is there any + way by which it can guarantee that the two will be consistent. + + Because of per-version changes in definitions and tables, Stringprep + and documents depending on it are now tied to Unicode Version 3.2 + [Unicode32] and full interoperability of Internet Standard UTF-8 + [RFC3629], when used with normalization as specified here, is + dependent on normalization definitions and the definition of UTF-8 + itself not changing after Unicode Version 5.0. These assumptions + seem fairly safe, but they are still assumptions. Rather than being + linked to the latest available version of Unicode, version 5.0 + [Unicode] or broader concepts of version independence based on + specific assumptions and conditions, this specification could + reasonably have been tied, like Stringprep and Nameprep to Unicode + 3.2 [Unicode32] or some more recent intermediate version, but, in + addition to the obvious disadvantages of having different IETF + standards tied to different versions of Unicode, the library-based + application implementation behavior described above makes these + version linkages nearly meaningless in practice. + + In theory, one can get around this problem in four ways: + + 1. Freeze on a particular version of Unicode and try to insist that + applications enforce that version by, e.g., containing lists of + unassigned characters and prohibiting their use. Of course, this + would prohibit evolution to include newly-added scripts and the + tables of unassigned code points would be cumbersome. + + 2. Require that every Unicode "text" string or file start with a + version indication, somewhat akin to the "byte order mark" + indicator. It is unlikely that this provision would be + practical. More important, it would require that each + application implementation be prepared to either support multiple + normalization tables and versions or that it reject text from + Unicode versions with which it was not prepared to deal. + + 3. Devise a different set of normalization rules that would, e.g., + guarantee that no character assigned to a previously-unassigned + code point in Unicode was ever normalized to anything but itself + and use those rules instead of NFC. It is not clear whether or + not such a set of rules is possible or whether some other + + + +Klensin & Padlipsky Standards Track [Page 8] + +RFC 5198 Network Unicode March 2008 + + + completely stable set of rules could be devised, perhaps in + combination with restrictions on the ways in which characters + were added in future versions of Unicode. + + 4. Devise a normalization process that is otherwise equivalent to + NFC but that rejects code points that are unassigned in the + current version of Unicode, rather than mapping those code points + to themselves. This would still leave some risk of incompatible + corrections in Unicode and possibly a few edge cases, but it is + probably stable enough for Internet use in the overwhelming + number of cases. This process has been discussed in the Unicode + Consortium under the name "Stable NFC". + + None of these approaches seems ideal: the ideal procedure would be as + stable and predictable as ASCII has been. But that level is simply + not feasible as long as Unicode continues to evolve by the addition + of new code points and scripts. The fourth option listed above + appears to be a reasonable compromise. + +6. Security Considerations + + This specification provides a standard form for the use of Unicode as + "network text". Most of the same security issues that apply to + UTF-8, as discussed in [RFC3629], apply to it, although it should be + slightly less subject to some risks by virtue of requiring NFC + normalization and generally being somewhat more restrictive. + However, shifts in Unicode versions, as discussed in Section 5.2, may + introduce other security issues. + + Programs that receive these streams should use extreme caution about + assuming that incoming data are normalized, since it might be + possible to use unnormalized forms, as well as invalid UTF-8, as part + of an attack. In particular, firewalls and other systems that + interpret UTF-8 streams should be developed with the clear knowledge + that an attacker may deliberately send unnormalized text, for + instance, to avoid detection by naive text-matching systems. + + NVT contains a requirement, of necessity repeated here (see + Section 2), that the CR character be immediately followed by either + LF or ASCII NUL (an octet with all bits zero). NUL may be + problematic for some programming languages that use it as a string + terminator, and hence a trap for the unwary, unless caution is used. + This may be an additional reason to avoid the use of CR entirely, + except in sequence with LF, as suggested above. + + The discussion about Unicode versions above (see Section 4 and + Section 5.2) makes several assumptions about future versions of + Unicode, about NFC normalization being applied properly, and about + + + +Klensin & Padlipsky Standards Track [Page 9] + +RFC 5198 Network Unicode March 2008 + + + UTF-8 being processed and transmitted exactly as specified in RFC + 3629. If any of those assumptions are not correct, then there are + cases in which strings that would be considered equivalent do not + compare equal. Robust code should be prepared for those + possibilities. + +7. Acknowledgments + + Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for + suggestions about Unicode normalization that led to the format + described here, and especially to Mark for providing the paragraphs + that describe the role of NFC. Thanks also to Mark, Doug Ewell, + Asmus Freytag for corrected text describing Unicode transmission + forms, and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin + Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern + Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George + Michaelson, Chris Newman, and Marcos Sanz for a number of helpful + comments and clarification requests. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Klensin & Padlipsky Standards Track [Page 10] + +RFC 5198 Network Unicode March 2008 + + +Appendix A. History and Context + + This subsection contains a review of prior work in the ARPANET and + Internet to establish a standard text type, work that establishes the + context and motivation for the approach taken in this document. The + text is explanatory rather than normative: nothing in this section is + intended to change or update any current specification. Those who + are uninterested in this review and analysis can safely skip this + section. + + One of the earlier application design decisions made in the + development of ARPANET, a decision that was carried forward into the + Internet, was the decision to standardize on a single and very + specific coding for "text" to be passed across the network [RFC0020]. + Hosts on the network were then responsible for translating or mapping + from whatever character coding conventions were used locally to that + common intermediate representation, with sending hosts mapping to it + and receiving ones mapping from it to their local forms as needed. + It is interesting to note that at the time the ARPANET was being + developed, participating host operating systems used at least three + different character coding standards: the antiquated BCD (Binary + Coded Decimal), the then-dominant major manufacturer-backed EBCDIC + (Extended BCD Interchange Code), and the then-still emerging ASCII + (American Standard Code for Information Interchange). Since the + ARPANET was an "open" project and EBCDIC was intimately linked to a + particular hardware vendor, the original Network Working Group agreed + that its standard should be ASCII. That ASCII form was precisely + "7-bit ASCII in an 8-bit field", which was in effect a compromise + between hosts that were natively 7-bit oriented (e.g., with five + seven-bit characters in a 36-bit word), those that were 8-bit + oriented (using eight-bit characters) and those that placed the + seven-bit ASCII characters in 9-bit fields with two leading zero bits + (four characters in a 36-bit word). + + More standardization was suggested in the first preliminary + description of the Telnet protocol [RFC0097]. With the iterations of + that protocol [RFC0137] [RFC0139] and the drawing together of an + essentially formal definition somewhat later [RFC0318], a standard + abstraction, the Network Virtual Terminal (NVT) was established. NVT + character-coding conventions (initially called "Telnet ASCII" and + later called "NVT ASCII", or, more casually, "network ASCII") + included the requirement that Carriage Return followed by Line Feed + (CRLF) be the common representation for ending lines of text (given + that some participating "Host" operating systems used the one + natively, some the other, at least one used both, and a few used + neither (preferring variable-length lines with counts or special + delimiters or markers instead) and specified conventions for some + other characters. Also, since NVT ASCII was restricted to seven-bit + + + +Klensin & Padlipsky Standards Track [Page 11] + +RFC 5198 Network Unicode March 2008 + + + characters, use of the high-order bit in octets was reserved for the + transmission of control signaling information. + + At a very high level, the concept was that a system could use + whatever character coding and line representations were appropriate + locally, but text transmitted over the network as text must conform + to the single "network virtual terminal" convention. Virtually all + early Internet protocols that presume transfer of "text" assume this + virtual terminal model, although different ones assume or limit it in + different ways. Telnet, the command stream and ASCII Type in FTP + [RFC0542], the message stream in SMTP transfer [RFC2821], and the + strings passed to finger [RFC0742] and whois [RFC0954] are the + classic examples. More recently, HTTP [RFC1945] [RFC2616] follows + the same general model but permits 8-bit data and leaves the line end + sequence unspecified (the latter has been the source of a significant + number of problems). + +Appendix B. The ASCII NVT Definition + + The main body of this specification is intended as an update to, and + internationalized version of, the Net-ASCII definition. The + specification is self-contained in that parts of the Net-ASCII + definition that are no longer recommended are not included above. + Because Net-ASCII evolved somewhat over time and there has been + debate about which specification is the "official" Net-ASCII, it is + appropriate to review the key elements of that definition here. This + review is informal with regard to the contents of Net-ASCII and + should not be considered as a normative update or summary of the + earlier specifications (Section 2 does specify some normative updates + to those specifications and some comments below are consistent with + it). + + The first part of the section titled "THE NVT PRINTER AND KEYBOARD" + in RFC 854 [RFC0854] is generally, although not universally, + considered to be the normative definition of the (ASCII) Network + Virtual Terminal and hence of Net-ASCII. It includes not only the + graphic ASCII characters but a number of control characters. The + latter are given Internet-specific meanings that are often more + specific than the definitions in the ASCII specification. In today's + usage, and for the present specification, the following + clarifications and updates to that list should be noted. Each one is + accompanied by a brief explanation of the reason why the original + specification is no longer appropriate. + + 1. The "defined but not required" codes -- BEL (U+0007), BS + (U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the + undefined control codes ("C0") SHOULD NOT be used unless required + by exceptional circumstances. Either their original "network + + + +Klensin & Padlipsky Standards Track [Page 12] + +RFC 5198 Network Unicode March 2008 + + + printer" definitions are no longer in general use, common + practice has evolved away from the formats specified there, or + their use to simulate characters that are better handled by + Unicode is no longer appropriate. While the appearance of some + of these characters on the list may seem surprising, BS now has + an ambiguous interpretation in practice (erasing in some systems + but not in others), the width associated with HT varies with the + environment, and VT and FF do not have a uniform effect with + regard to either vertical positioning or the associated + horizontal position result. Of course, telnet escapes are not + considered part of the data stream and hence are unaffected by + this provision. + + 2. In Net-ASCII, CR MUST NOT appear except when immediately followed + by either NUL or LF, with the latter (CR LF) designating the "new + line" function. Today and as specified above, CR should + generally appear only when followed by LF. Because page layout + is better done in other ways, because NUL has a special + interpretation in some programming languages, and to avoid other + types of confusion, CR NUL should preferably be avoided as + specified above. + + 3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF + sequences (e.g., CR LF CR LF). + + 4. The historical NVT documents do not call out either "bare LF" (LF + without CR) or HT for special treatment. Both have generally + been understood to be problematic. In the case of LF, there is a + difference in interpretation as to whether its semantics imply + "go to same position on the next line" or "go to the first + position on the next line" and interoperability considerations + suggest not depending on which interpretation the receiver + applies. At the same time, misinterpretation of LF is less + harmful than misinterpretation of "bare" CR: in the CR case, text + may be erased or made completely unreadable; in the LF one, the + worst consequence is a very funny-looking display. Obviously, HT + is problematic because there is no standard way to transmit + intended tab position or width information in running text. + Again, the harm is unlikely to be great if HT is simply + interpreted as one or more spaces, but, in general, it cannot be + relied upon to format information. + + It is worth noting that the telnet IAC character (an octet consisting + of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that + particular octet cannot appear in a valid UTF-8 string. However, + while few of them have been used, telnet permits other command- + introducer characters whose bit sequences in an octet may be part of + valid UTF-8 characters. While it causes no ambiguity in UTF-8, + + + +Klensin & Padlipsky Standards Track [Page 13] + +RFC 5198 Network Unicode March 2008 + + + Unicode assigns a graphic character ("Latin Small Letter Y with + Diaeresis") to U+00FF (octets C3 B0 in UTF-8). Some caution is + clearly in order in this area. + +Appendix C. The Line-Ending Problem + + The definition of how a line ending should be denoted in plain text + strings on the wire for the Internet has been controversial from even + before the introduction of NVT. Some have argued that recipients + should be required to interpret almost anything that a sender might + intend as a line ending as actually a line ending. Others have + pointed out that this would lead to some ambiguities of + interpretation and presentation and would violate the principle that + we should minimize the number of forms that are permitted on the wire + in order to promote interoperability and eliminate the "every + recipient needs to understand every sender format" problem. The + design of this specification, like that of NVT, takes the latter + approach. Its designers believe that there is little point in a + standard if it is to specify "anyone can do whatever they like and + the receiver just needs to cope". + + A further discussion of the nature and evolution of the line-ending + problem appears in Section 5.8 of the Unicode Standard [Unicode] and + is suggested for additional reading. If we were starting with the + Internet today, it would probably be sensible to follow the + recommendation there and use LS (U+2028) exclusively, in preference + to CRLF. However, the installed base of use of CRLF and the + importance of forward compatibility with NVT and protocols that + assume it makes that impossible, so it is necessary to continue using + CRLF as the "New Line Function" ("NLF", see the terminology section + in that reference). + +Appendix D. A Note about Related Future Work + + Consideration should be given to a Telnet (or SSH [RFC4251]) option + to specify this type of stream and an FTP extension [RFC0959] to + permit a new "Unicode text" data TYPE. + + + + + + + + + + + + + + +Klensin & Padlipsky Standards Track [Page 14] + +RFC 5198 Network Unicode March 2008 + + +References + +Normative References + + [ISO10646] International Organization for Standardization, + "Information Technology - Universal Multiple-Octet + Coded Character Set (UCS) - Part 1: Architecture + and Basic Multilingual Plane", ISO/ + IEC 10646-1:2000, October 2000. + + [NFC] Davis, M. and M. Duerst, "Unicode Standard Annex + #15: Unicode Normalization Forms", October 2006, + <http://www.unicode.org/reports/tr15/>. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO + 10646", STD 63, RFC 3629, November 2003. + + [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for + Syntax Specifications: ABNF", STD 68, RFC 5234, + January 2008. + + [Unicode] The Unicode Consortium, "The Unicode Standard, + Version 5.0", 2007. + + Boston, MA, USA: Addison-Wesley. ISBN + 0-321-48091-0 + + [Unicode32] The Unicode Consortium, "The Unicode Standard, + Version 3.0", 2000. + + (Reading, MA, Addison-Wesley, 2000. ISBN 0-201- + 61633-5). Version 3.2 consists of the definition + in that book as amended by the Unicode Standard + Annex #27: Unicode 3.1 + (http://www.unicode.org/reports/tr27/) and by the + Unicode Standard Annex #28: Unicode 3.2 + (http://www.unicode.org/reports/tr28/). + + + + + + + + + + + +Klensin & Padlipsky Standards Track [Page 15] + +RFC 5198 Network Unicode March 2008 + + +Informative References + + [ASCII] American National Standards Institute (formerly + United States of America Standards Institute), "USA + Code for Information Interchange", ANSI X3.4-1968, + 1968. + + ANSI X3.4-1968 has been replaced by newer versions + with slight modifications, but the 1968 version + remains definitive for the Internet. ISO 646 + International Reverence Version (IRV) + [ISO.646.1991] is usually considered equivalent to + ASCII. + + [ISO.646.1991] International Organization for Standardization, + "Information technology - ISO 7-bit coded character + set for information interchange", ISO Standard 646, + 1991. + + [NamedSequences] The Unicode Consortium, "NamedSequences-4.1.0.txt", + 2005, <http://www.unicode.org/Public/UNIDATA/ + NamedSequences.txt>. + + [RFC0020] Cerf, V., "ASCII format for network interchange", + RFC 20, October 1969. + + [RFC0097] Melvin, J. and R. Watson, "First Cut at a Proposed + Telnet Protocol", RFC 97, February 1971. + + [RFC0137] O'Sullivan, T., "Telnet Protocol - a proposed + document", RFC 137, April 1971. + + [RFC0139] O'Sullivan, T., "Discussion of Telnet Protocol", + RFC 139, May 1971. + + [RFC0318] Postel, J., "Telnet Protocols", RFC 318, + April 1972. + + [RFC0542] Neigus, N., "File Transfer Protocol", RFC 542, + August 1973. + + [RFC0698] Mock, T., "Telnet extended ASCII option", RFC 698, + July 1975. + + [RFC0742] Harrenstien, K., "NAME/FINGER Protocol", RFC 742, + December 1977. + + + + + +Klensin & Padlipsky Standards Track [Page 16] + +RFC 5198 Network Unicode March 2008 + + + [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol + Specification", STD 8, RFC 854, May 1983. + + [RFC0954] Harrenstien, K., Stahl, M., and E. Feinler, + "NICNAME/WHOIS", RFC 954, October 1985. + + [RFC0959] Postel, J. and J. Reynolds, "File Transfer + Protocol", STD 9, RFC 959, October 1985. + + [RFC1945] Berners-Lee, T., Fielding, R., and H. Nielsen, + "Hypertext Transfer Protocol -- HTTP/1.0", + RFC 1945, May 1996. + + [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, January 1998. + + [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., + Masinter, L., Leach, P., and T. Berners-Lee, + "Hypertext Transfer Protocol -- HTTP/1.1", + RFC 2616, June 1999. + + [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of + ISO 10646", RFC 2781, February 2000. + + [RFC2821] Klensin, J., "Simple Mail Transfer Protocol", + RFC 2821, April 2001. + + [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of + Internationalized Strings ("stringprep")", + RFC 3454, December 2002. + + [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A + Stringprep Profile for Internationalized Domain + Names (IDN)", RFC 3491, March 2003. + + [RFC3912] Daigle, L., "WHOIS Protocol Specification", + RFC 3912, September 2004. + + [RFC4251] Ylonen, T. and C. Lonvick, "The Secure Shell (SSH) + Protocol Architecture", RFC 4251, January 2006. + + [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, + "Review and Recommendations for Internationalized + Domain Names (IDNs)", RFC 4690, September 2006. + + + + + + + +Klensin & Padlipsky Standards Track [Page 17] + +RFC 5198 Network Unicode March 2008 + + +Authors' Addresses + + John C Klensin + 1770 Massachusetts Ave, #322 + Cambridge, MA 02140 + USA + + Phone: +1 617 491 5735 + EMail: john-ietf@jck.com + + + Michael A. Padlipsky + 8011 Stewart Ave. + Los Angeles, CA 90045 + USA + + Phone: +1 310-670-4288 + EMail: the.map@alum.mit.edu + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Klensin & Padlipsky Standards Track [Page 18] + +RFC 5198 Network Unicode March 2008 + + +Full Copyright Statement + + Copyright (C) The IETF Trust (2008). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND + THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS + OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF + THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the procedures with respect to rights in RFC documents can be + found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at + ietf-ipr@ietf.org. + + + + + + + + + + + + +Klensin & Padlipsky Standards Track [Page 19] + |