1 files changed, 1067 insertions, 0 deletions
diff --git a/doc/rfc/rfc5198.txt b/doc/rfc/rfc5198.txt
new file mode 100644
index 0000000..aa55dcc
--- /dev/null
+++ b/doc/rfc/rfc5198.txt
@@ -0,0 +1,1067 @@
+
+
+
+
+
+
+Network Working Group                                         J. Klensin
+Request for Comments: 5198                                  M. Padlipsky
+Obsoletes: 698                                                March 2008
+Updates: 854
+Category: Standards Track
+
+
+                 Unicode Format for Network Interchange
+
+Status of This Memo
+
+   This document specifies an Internet standards track protocol for the
+   Internet community, and requests discussion and suggestions for
+   improvements.  Please refer to the current edition of the "Internet
+   Official Protocol Standards" (STD 1) for the standardization state
+   and status of this protocol.  Distribution of this memo is unlimited.
+
+Abstract
+
+   The Internet today is in need of a standardized form for the
+   transmission of internationalized "text" information, paralleling the
+   specifications for the use of ASCII that date from the early days of
+   the ARPANET.  This document specifies that format, using UTF-8 with
+   normalization and specific line-ending sequences.
+
+Table of Contents
+
+   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
+     1.1.  Requirement for a Standardized Text Stream Format  . . . .  2
+     1.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  3
+   2.  Net-Unicode Definition . . . . . . . . . . . . . . . . . . . .  3
+   3.  Normalization  . . . . . . . . . . . . . . . . . . . . . . . .  5
+   4.  Versions of Unicode  . . . . . . . . . . . . . . . . . . . . .  5
+   5.  Applicability and Stability of this Specification  . . . . . .  7
+     5.1.  Use in IETF Applications Specifications  . . . . . . . . .  7
+     5.2.  Unicode Versions and Applicability . . . . . . . . . . . .  7
+   6.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
+   7.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 10
+   Appendix A.  History and Context . . . . . . . . . . . . . . . . . 11
+   Appendix B.  The ASCII NVT Definition  . . . . . . . . . . . . . . 12
+   Appendix C.  The Line-Ending Problem . . . . . . . . . . . . . . . 14
+   Appendix D.  A Note about Related Future Work  . . . . . . . . . . 14
+   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
+     Normative References . . . . . . . . . . . . . . . . . . . . . . 15
+     Informative References . . . . . . . . . . . . . . . . . . . . . 16
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 1]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+1.  Introduction
+
+1.1.  Requirement for a Standardized Text Stream Format
+
+   Historically, Internet protocols have been largely ASCII-based and
+   references to "text" in protocols have assumed ASCII text and
+   specifically text in Network Virtual Terminal ("NVT") or "Network
+   ASCII" form (see Appendix A and Appendix B).  Protocols and formats
+   that have moved beyond ASCII have included arrangements to
+   specifically identify the character set and often the language being
+   used.
+
+   In our more internationalized world, "text" clearly no longer equates
+   unambiguously to "network ASCII".  Fortunately, however, we are
+   converging on Unicode [Unicode] [ISO10646] as a single international
+   interchange character coding and no longer need to deal with per-
+   script standards for character sets (e.g., one standard for each of
+   Arabic, Cyrillic, Devanagari, etc., or even standards keyed to
+   languages that are usually considered to share a script, such as
+   French, German, or Swedish).  Unfortunately, though, while it is
+   certainly time to define a Unicode-based text type for use as a
+   common text interchange format, "use Unicode" involves even more
+   ambiguity than "use ASCII" did decades ago.
+
+   Unicode identifies each character by an integer, called its "code
+   point", in the range 0-0x10ffff.  These integers can be encoded into
+   byte sequences for transmission in at least three standard and
+   generally-recognized encoding forms, all of which are completely
+   defined in The Unicode Standard and the documents cited below:
+
+   o  UTF-8 [RFC3629] defines a variable-length encoding that may be
+      applied uniformly to all code points.
+
+   o  UTF-16 [RFC2781] encodes the range of Unicode characters whose
+      code points are less than 65536 straightforwardly as 16-bit
+      integers, and provides a "surrogate" mechanism for encoding larger
+      code points in 32 bits.
+
+   o  UTF-32 (also known as UCS-4) simply encodes each code point as a
+      32-bit integer.
+
+   Older forms and nomenclature, such as the 16-bit UCS-2, are now
+   strongly discouraged.
+
+   As with ASCII, any of these forms may be used with different line-
+   ending conventions.  That flexibility can be an additional source of
+   confusion with, e.g., index (offset) references into documents based
+   on character counts.
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 2]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+   This document proposes to establish "Net-Unicode" as a new
+   standardized text transmission form for the Internet, to serve as an
+   internationalized alternative for NVT ASCII when specified in new --
+   and, where appropriate, updated -- protocols.  UTF-8 [RFC3629] is
+   chosen for the coding because it has good compatibility properties
+   with ASCII and for other reasons discussed in the existing IETF
+   character set policy [RFC2277].  "Net-Unicode" is specified in
+   Section 2; the subsequent sections of the document provide background
+   and explanation.
+
+   Whenever there is a choice, Unicode SHOULD be used with the text
+   encoding specified here.  This combination is preferred to the
+   double-byte encoding of "extended ASCII" [RFC0698] or the assorted
+   per-language or per-country character coding systems.
+
+1.2.  Terminology
+
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+   document are to be interpreted as described in [RFC2119].
+
+2.  Net-Unicode Definition
+
+   The Network Unicode format (Net-Unicode) is defined as follows.
+   Parts of this definition are deliberately informal, providing
+   guidance for specific profiles or rules in the protocols that
+   reference this one rather than firm rules that apply globally.
+
+   1.  Characters MUST be encoded in UTF-8 as defined in [RFC3629].
+
+   2.  If the protocol has the concept of "lines", line-endings MUST be
+       indicated by the sequence Carriage-Return (CR, U+000D) followed
+       by Line-Feed (LF, U+000A), often known just as CRLF.  CR SHOULD
+       NOT appear except when followed by LF.  The only other allowed
+       context in which CR is permitted is in the combination CR NUL,
+       which is not recommended (see the note at the end of this
+       section).
+
+   3.  The control characters in the ASCII range (U+0000 to U+001F and
+       U+007F to U+009F) SHOULD generally be avoided.  Space (SP,
+       U+0020), CR, LF, and Form Feed (FF, U+000C) are exceptions to
+       this principle, but use of all but the first requires care as
+       discussed elsewhere in this document.  The so-called "C1
+       Controls" (U+0080 through U+009F), which did not appear in ASCII,
+       MUST NOT appear.
+
+       FF should be used only with caution: it does not have a standard
+       and universal interpretation and, in particular, if its use
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 3]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+       assumes a page length, such assumptions may not be appropriate in
+       international contexts (e.g., considering 8.5x11 inch paper
+       versus A4).  Other control characters are used to affect display
+       format, control devices, or to structure files.  None of those
+       uses is appropriate for streams of plain text.
+
+   4.  Before transmission, all character sequences SHOULD be normalized
+       according to Unicode normalization form "NFC" (see Section 3).
+
+   5.  As suggested in Section 6 of RFC 3629, the Byte Order Mark
+       ("BOM") signature MUST NOT appear at the beginning of these text
+       strings.
+
+   6.  Systems conforming to this specification MUST NOT transmit any
+       string containing any code point that is unassigned in the
+       version of Unicode on which they are dependent.  The version of
+       NFC and the version of Unicode used by that system MUST be
+       consistent.
+
+   The use of LF without CR is questionable; see Appendix B for more
+   discussion.  The newer control characters IND (U+0084) and NEL ("Next
+   Line", U+0085) might have been used to disambiguate the various line-
+   ending situations, but, because their use has not been established on
+   the Internet, because many protocols require CRLF, and because IND
+   and NEL fall within the "C1 Controls" group (see below), they MUST
+   NOT be used.  Similar observations apply to the yet newer line and
+   paragraph separators at U+2028 and U+2029 and any future characters
+   that might be defined to serve these functions.  For this
+   specification and protocols that depend on it, lines end in CRLF and
+   only in CRLF.  Anything that does not end in CRLF is either not a
+   line or is severely malformed.
+
+   The NVT specification contained a number of additional provisions,
+   e.g., for the optional use of backspacing and "bare CR" (sent as CR
+   NUL) to generate overstruck character sequences.  The much greater
+   number of precomposed characters in Unicode, the availability of
+   combining characters, and the growing use of markup conventions of
+   various types to show, e.g., emphasis (rather than attempting to do
+   that via the use of special characters), should make such sequences
+   largely unnecessary.  These sequences SHOULD be avoided if at all
+   possible.  However, because they were optional in NVT applications
+   and this specification is an NVT superset, they cannot be prohibited
+   entirely.  The most important of these rules is that CR MUST NOT
+   appear unless it is immediately followed by LF (indicating end of
+   line) or NUL.  Because NUL (an octet whose value is all zeros, i.e.,
+   %x00 in the notation of [RFC5234]) is hostile to programming
+   languages that use that character as a string delimiter, the CR NUL
+   sequence SHOULD be avoided for that reason as well.
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 4]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+3.  Normalization
+
+   There are cases where strings of Unicode are fundamentally
+   equivalent, essentially representing the same text.  These are called
+   "canonical equivalents" in the Unicode Standard.  For example, the
+   following pairs of strings are canonically equivalent:
+
+   U+2126 OHM SIGN
+   U+03A9 GREEK CAPITAL LETTER OMEGA
+
+   U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT
+   U+00E0 LATIN SMALL LETTER A WITH GRAVE
+
+   Comparison of strings becomes much easier if any such cases are
+   always represented by a single unique form.  The Unicode Consortium
+   specifies a normalization form, known as NFC [NFC], which provides
+   the necessary mappings and mechanisms to convert all canonically
+   equivalent sequences to a single unique form.  Typically, this form
+   produces precomposed characters for any sequences that can be
+   represented in that fashion.  It also reorders other combining marks
+   so that they have a unique and unambiguous order.
+
+   Of the various normalization forms defined as part of Unicode, NFC is
+   closest to actual use in practice, minimizes side-effects due to
+   considering characters equivalent that may not be equivalent in all
+   situations, and typically requires the least work when converting
+   from non-Unicode encodings.
+
+   The section above requires that, except in very unusual
+   circumstances, all Net-Unicode strings be transmitted in normalized
+   form.  Recognition of the fact that some implementations of
+   applications may rely on operating system libraries over which they
+   have little control and adherence to the robustness principle
+   suggests that receivers of such strings should be prepared to receive
+   unnormalized ones and to not react to that in excessive ways.
+
+4.  Versions of Unicode
+
+   Unicode changes and expands over time.  Large blocks of space are
+   reserved for future expansion.  New versions, which appear at regular
+   intervals, add new scripts and characters.  Occasionally they also
+   change some property definitions.  In retrospect, one of the
+   advantages of ASCII [ASCII] when it was chosen was that the code
+   space was full when the Standard was first published.  There was no
+   practical way to add characters or change code point assignments
+   without being obviously incompatible.
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 5]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+   While there are some security issues if people deliberately try to
+   trick the system (see Section 6), Unicode version changes should not
+   have a significant impact on the text stream specification of this
+   document for the following reasons:
+
+   o  The transformation between Unicode code table positions and the
+      corresponding UTF-8 code is algorithmic; it does not depend on
+      whether a code point has been assigned or not.
+
+   o  The normalization recommended here, NFC (see Section 3), performs
+      a very limited set of mappings, much more limited than those of
+      the more extensive NFKC used in, e.g., Nameprep [RFC3491].
+
+   The NFC tables may be updated over time as new characters are added,
+   but the Unicode Consortium has guaranteed the stability of all NFC
+   strings.  That is, if a string does not contain any unassigned
+   characters, and it is normalized according to NFC, it will always be
+   normalized according to all future versions of the Unicode Standard.
+   The stability of the Net-Unicode format is thus guaranteed when any
+   implementation that converts text into Net-Unicode format does not
+   permit unassigned characters.
+
+   Because Unicode code points that are reserved for private use do not
+   have standard definitions or normalization interpretations, they
+   SHOULD be avoided in strings intended for Internet interchange.
+
+   Were Unicode to be changed in a way that violated these assumptions,
+   i.e., that either invalidated the byte string order specified in RFC
+   3629 or that changed the stability of NFC as stated above, this
+   specification would not apply.  Put differently, this specification
+   applies only to versions of Unicode starting with version 5.0 and
+   extending to, but not including, any version for which changes are
+   made in either the UTF-8 definition or to NFC stability.  Such
+   changes would violate established Unicode policies and are hence
+   unlikely, but, should they occur, it would be necessary to evaluate
+   them for compatibility with this specification and other Internet
+   uses of NFC.
+
+   If the specification of a protocol references this one, strings that
+   are received by that protocol and that appear to be UTF-8 and are not
+   otherwise identified (e.g., by charset labeling) SHOULD be treated as
+   using UTF-8 in conformance with this specification.
+
+
+
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 6]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+5.  Applicability and Stability of this Specification
+
+5.1.  Use in IETF Applications Specifications
+
+   During the development of this specification, there was some
+   confusion about where it would be useful given that, e.g., the
+   individual MIME media types used in email and with HTTP have their
+   own rules about UTF-8 character types and normalization, and the
+   application transport protocols impose their own conventions about
+   line endings.  There are three answers.  The first is that, in
+   retrospect, it would have been better to have those protocols and
+   content types standardized in the way specified here, even though it
+   is certainly too late to change them at this time.  The second is
+   that we have several protocols that are dependent on either the
+   original Telnet design or other arrangements requiring a standard,
+   interoperable, string definition without specific content-labels of
+   one sort or another.  Whois [RFC3912] is an example member of this
+   group.  As consideration is given to upgrading them for non-ASCII
+   use, this specification provides a normative reference that provides
+   the same stability that NVT has provided the ASCII forms.  This
+   specification is intended for use by other specifications that have
+   not yet defined how to use Unicode.  Having a preferred standard
+   Internet definition for Unicode text streams -- rather than just one
+   for transmission codings -- may help improve the specification and
+   interoperability of protocols to be developed in the future.  This
+   specification is not intended for use with specifications that
+   already allow the use of UTF-8 and precisely define that use.
+
+5.2.  Unicode Versions and Applicability
+
+   The IETF faces a practical dilemma with regard to versions of
+   Unicode.  Each new version brings with it new characters and
+   sometimes new combining characters.  Version 5.0 introduces the new
+   concept of sequences of characters named as if they were individual
+   characters (see [NamedSequences]).  The normalization represented by
+   NFC is stable if all strings are transmitted and stored in normalized
+   form if corrections are never made to character definitions or
+   normalization tables and if unassigned code points are never used.
+   The latter is important because an unassigned code point always
+   normalizes to itself.  However, if the same code point is assigned to
+   a character in a future version, it may participate in some other
+   normalization mapping (some specific difficulties in this regard are
+   discussed in [RFC4690]).  It is worth noting that transmission in
+   normalized form is not required by either the IETF's UTF-8 Standard
+   [RFC3629] or by standards dependent on the current version of
+   Stringprep [RFC3454].
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 7]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+   All would be well with this as described in Section 4 except for one
+   problem: Applications typically do not perform their own conversions
+   to Unicode and may not perform their own normalizations but instead
+   rely on operating system or language library functions -- functions
+   that may be upgraded or otherwise changed without changes to the
+   application code itself.  Consequently, there may be no plausible way
+   for an application to know which version of Unicode, or which version
+   of the normalization procedures, it is utilizing, nor is there any
+   way by which it can guarantee that the two will be consistent.
+
+   Because of per-version changes in definitions and tables, Stringprep
+   and documents depending on it are now tied to Unicode Version 3.2
+   [Unicode32] and full interoperability of Internet Standard UTF-8
+   [RFC3629], when used with normalization as specified here, is
+   dependent on normalization definitions and the definition of UTF-8
+   itself not changing after Unicode Version 5.0.  These assumptions
+   seem fairly safe, but they are still assumptions.  Rather than being
+   linked to the latest available version of Unicode, version 5.0
+   [Unicode] or broader concepts of version independence based on
+   specific assumptions and conditions, this specification could
+   reasonably have been tied, like Stringprep and Nameprep to Unicode
+   3.2 [Unicode32] or some more recent intermediate version, but, in
+   addition to the obvious disadvantages of having different IETF
+   standards tied to different versions of Unicode, the library-based
+   application implementation behavior described above makes these
+   version linkages nearly meaningless in practice.
+
+   In theory, one can get around this problem in four ways:
+
+   1.  Freeze on a particular version of Unicode and try to insist that
+       applications enforce that version by, e.g., containing lists of
+       unassigned characters and prohibiting their use.  Of course, this
+       would prohibit evolution to include newly-added scripts and the
+       tables of unassigned code points would be cumbersome.
+
+   2.  Require that every Unicode "text" string or file start with a
+       version indication, somewhat akin to the "byte order mark"
+       indicator.  It is unlikely that this provision would be
+       practical.  More important, it would require that each
+       application implementation be prepared to either support multiple
+       normalization tables and versions or that it reject text from
+       Unicode versions with which it was not prepared to deal.
+
+   3.  Devise a different set of normalization rules that would, e.g.,
+       guarantee that no character assigned to a previously-unassigned
+       code point in Unicode was ever normalized to anything but itself
+       and use those rules instead of NFC.  It is not clear whether or
+       not such a set of rules is possible or whether some other
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 8]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+       completely stable set of rules could be devised, perhaps in
+       combination with restrictions on the ways in which characters
+       were added in future versions of Unicode.
+
+   4.  Devise a normalization process that is otherwise equivalent to
+       NFC but that rejects code points that are unassigned in the
+       current version of Unicode, rather than mapping those code points
+       to themselves.  This would still leave some risk of incompatible
+       corrections in Unicode and possibly a few edge cases, but it is
+       probably stable enough for Internet use in the overwhelming
+       number of cases.  This process has been discussed in the Unicode
+       Consortium under the name "Stable NFC".
+
+   None of these approaches seems ideal: the ideal procedure would be as
+   stable and predictable as ASCII has been.  But that level is simply
+   not feasible as long as Unicode continues to evolve by the addition
+   of new code points and scripts.  The fourth option listed above
+   appears to be a reasonable compromise.
+
+6.  Security Considerations
+
+   This specification provides a standard form for the use of Unicode as
+   "network text".  Most of the same security issues that apply to
+   UTF-8, as discussed in [RFC3629], apply to it, although it should be
+   slightly less subject to some risks by virtue of requiring NFC
+   normalization and generally being somewhat more restrictive.
+   However, shifts in Unicode versions, as discussed in Section 5.2, may
+   introduce other security issues.
+
+   Programs that receive these streams should use extreme caution about
+   assuming that incoming data are normalized, since it might be
+   possible to use unnormalized forms, as well as invalid UTF-8, as part
+   of an attack.  In particular, firewalls and other systems that
+   interpret UTF-8 streams should be developed with the clear knowledge
+   that an attacker may deliberately send unnormalized text, for
+   instance, to avoid detection by naive text-matching systems.
+
+   NVT contains a requirement, of necessity repeated here (see
+   Section 2), that the CR character be immediately followed by either
+   LF or ASCII NUL (an octet with all bits zero).  NUL may be
+   problematic for some programming languages that use it as a string
+   terminator, and hence a trap for the unwary, unless caution is used.
+   This may be an additional reason to avoid the use of CR entirely,
+   except in sequence with LF, as suggested above.
+
+   The discussion about Unicode versions above (see Section 4 and
+   Section 5.2) makes several assumptions about future versions of
+   Unicode, about NFC normalization being applied properly, and about
+
+
+
+Klensin & Padlipsky         Standards Track                     [Page 9]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+   UTF-8 being processed and transmitted exactly as specified in RFC
+   3629.  If any of those assumptions are not correct, then there are
+   cases in which strings that would be considered equivalent do not
+   compare equal.  Robust code should be prepared for those
+   possibilities.
+
+7.  Acknowledgments
+
+   Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for
+   suggestions about Unicode normalization that led to the format
+   described here, and especially to Mark for providing the paragraphs
+   that describe the role of NFC.  Thanks also to Mark, Doug Ewell,
+   Asmus Freytag for corrected text describing Unicode transmission
+   forms, and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin
+   Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern
+   Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George
+   Michaelson, Chris Newman, and Marcos Sanz for a number of helpful
+   comments and clarification requests.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 10]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+Appendix A.  History and Context
+
+   This subsection contains a review of prior work in the ARPANET and
+   Internet to establish a standard text type, work that establishes the
+   context and motivation for the approach taken in this document.  The
+   text is explanatory rather than normative: nothing in this section is
+   intended to change or update any current specification.  Those who
+   are uninterested in this review and analysis can safely skip this
+   section.
+
+   One of the earlier application design decisions made in the
+   development of ARPANET, a decision that was carried forward into the
+   Internet, was the decision to standardize on a single and very
+   specific coding for "text" to be passed across the network [RFC0020].
+   Hosts on the network were then responsible for translating or mapping
+   from whatever character coding conventions were used locally to that
+   common intermediate representation, with sending hosts mapping to it
+   and receiving ones mapping from it to their local forms as needed.
+   It is interesting to note that at the time the ARPANET was being
+   developed, participating host operating systems used at least three
+   different character coding standards: the antiquated BCD (Binary
+   Coded Decimal), the then-dominant major manufacturer-backed EBCDIC
+   (Extended BCD Interchange Code), and the then-still emerging ASCII
+   (American Standard Code for Information Interchange).  Since the
+   ARPANET was an "open" project and EBCDIC was intimately linked to a
+   particular hardware vendor, the original Network Working Group agreed
+   that its standard should be ASCII.  That ASCII form was precisely
+   "7-bit ASCII in an 8-bit field", which was in effect a compromise
+   between hosts that were natively 7-bit oriented (e.g., with five
+   seven-bit characters in a 36-bit word), those that were 8-bit
+   oriented (using eight-bit characters) and those that placed the
+   seven-bit ASCII characters in 9-bit fields with two leading zero bits
+   (four characters in a 36-bit word).
+
+   More standardization was suggested in the first preliminary
+   description of the Telnet protocol [RFC0097].  With the iterations of
+   that protocol [RFC0137] [RFC0139] and the drawing together of an
+   essentially formal definition somewhat later [RFC0318], a standard
+   abstraction, the Network Virtual Terminal (NVT) was established.  NVT
+   character-coding conventions (initially called "Telnet ASCII" and
+   later called "NVT ASCII", or, more casually, "network ASCII")
+   included the requirement that Carriage Return followed by Line Feed
+   (CRLF) be the common representation for ending lines of text (given
+   that some participating "Host" operating systems used the one
+   natively, some the other, at least one used both, and a few used
+   neither (preferring variable-length lines with counts or special
+   delimiters or markers instead) and specified conventions for some
+   other characters.  Also, since NVT ASCII was restricted to seven-bit
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 11]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+   characters, use of the high-order bit in octets was reserved for the
+   transmission of control signaling information.
+
+   At a very high level, the concept was that a system could use
+   whatever character coding and line representations were appropriate
+   locally, but text transmitted over the network as text must conform
+   to the single "network virtual terminal" convention.  Virtually all
+   early Internet protocols that presume transfer of "text" assume this
+   virtual terminal model, although different ones assume or limit it in
+   different ways.  Telnet, the command stream and ASCII Type in FTP
+   [RFC0542], the message stream in SMTP transfer [RFC2821], and the
+   strings passed to finger [RFC0742] and whois [RFC0954] are the
+   classic examples.  More recently, HTTP [RFC1945] [RFC2616] follows
+   the same general model but permits 8-bit data and leaves the line end
+   sequence unspecified (the latter has been the source of a significant
+   number of problems).
+
+Appendix B.  The ASCII NVT Definition
+
+   The main body of this specification is intended as an update to, and
+   internationalized version of, the Net-ASCII definition.  The
+   specification is self-contained in that parts of the Net-ASCII
+   definition that are no longer recommended are not included above.
+   Because Net-ASCII evolved somewhat over time and there has been
+   debate about which specification is the "official" Net-ASCII, it is
+   appropriate to review the key elements of that definition here.  This
+   review is informal with regard to the contents of Net-ASCII and
+   should not be considered as a normative update or summary of the
+   earlier specifications (Section 2 does specify some normative updates
+   to those specifications and some comments below are consistent with
+   it).
+
+   The first part of the section titled "THE NVT PRINTER AND KEYBOARD"
+   in RFC 854 [RFC0854] is generally, although not universally,
+   considered to be the normative definition of the (ASCII) Network
+   Virtual Terminal and hence of Net-ASCII.  It includes not only the
+   graphic ASCII characters but a number of control characters.  The
+   latter are given Internet-specific meanings that are often more
+   specific than the definitions in the ASCII specification.  In today's
+   usage, and for the present specification, the following
+   clarifications and updates to that list should be noted.  Each one is
+   accompanied by a brief explanation of the reason why the original
+   specification is no longer appropriate.
+
+   1.  The "defined but not required" codes -- BEL (U+0007), BS
+       (U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the
+       undefined control codes ("C0") SHOULD NOT be used unless required
+       by exceptional circumstances.  Either their original "network
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 12]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+       printer" definitions are no longer in general use, common
+       practice has evolved away from the formats specified there, or
+       their use to simulate characters that are better handled by
+       Unicode is no longer appropriate.  While the appearance of some
+       of these characters on the list may seem surprising, BS now has
+       an ambiguous interpretation in practice (erasing in some systems
+       but not in others), the width associated with HT varies with the
+       environment, and VT and FF do not have a uniform effect with
+       regard to either vertical positioning or the associated
+       horizontal position result.  Of course, telnet escapes are not
+       considered part of the data stream and hence are unaffected by
+       this provision.
+
+   2.  In Net-ASCII, CR MUST NOT appear except when immediately followed
+       by either NUL or LF, with the latter (CR LF) designating the "new
+       line" function.  Today and as specified above, CR should
+       generally appear only when followed by LF.  Because page layout
+       is better done in other ways, because NUL has a special
+       interpretation in some programming languages, and to avoid other
+       types of confusion, CR NUL should preferably be avoided as
+       specified above.
+
+   3.  LF CR SHOULD NOT appear except as a side-effect of multiple CR LF
+       sequences (e.g., CR LF CR LF).
+
+   4.  The historical NVT documents do not call out either "bare LF" (LF
+       without CR) or HT for special treatment.  Both have generally
+       been understood to be problematic.  In the case of LF, there is a
+       difference in interpretation as to whether its semantics imply
+       "go to same position on the next line" or "go to the first
+       position on the next line" and interoperability considerations
+       suggest not depending on which interpretation the receiver
+       applies.  At the same time, misinterpretation of LF is less
+       harmful than misinterpretation of "bare" CR: in the CR case, text
+       may be erased or made completely unreadable; in the LF one, the
+       worst consequence is a very funny-looking display.  Obviously, HT
+       is problematic because there is no standard way to transmit
+       intended tab position or width information in running text.
+       Again, the harm is unlikely to be great if HT is simply
+       interpreted as one or more spaces, but, in general, it cannot be
+       relied upon to format information.
+
+   It is worth noting that the telnet IAC character (an octet consisting
+   of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that
+   particular octet cannot appear in a valid UTF-8 string.  However,
+   while few of them have been used, telnet permits other command-
+   introducer characters whose bit sequences in an octet may be part of
+   valid UTF-8 characters.  While it causes no ambiguity in UTF-8,
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 13]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+   Unicode assigns a graphic character ("Latin Small Letter Y with
+   Diaeresis") to U+00FF (octets C3 B0 in UTF-8).  Some caution is
+   clearly in order in this area.
+
+Appendix C.  The Line-Ending Problem
+
+   The definition of how a line ending should be denoted in plain text
+   strings on the wire for the Internet has been controversial from even
+   before the introduction of NVT.  Some have argued that recipients
+   should be required to interpret almost anything that a sender might
+   intend as a line ending as actually a line ending.  Others have
+   pointed out that this would lead to some ambiguities of
+   interpretation and presentation and would violate the principle that
+   we should minimize the number of forms that are permitted on the wire
+   in order to promote interoperability and eliminate the "every
+   recipient needs to understand every sender format" problem.  The
+   design of this specification, like that of NVT, takes the latter
+   approach.  Its designers believe that there is little point in a
+   standard if it is to specify "anyone can do whatever they like and
+   the receiver just needs to cope".
+
+   A further discussion of the nature and evolution of the line-ending
+   problem appears in Section 5.8 of the Unicode Standard [Unicode] and
+   is suggested for additional reading.  If we were starting with the
+   Internet today, it would probably be sensible to follow the
+   recommendation there and use LS (U+2028) exclusively, in preference
+   to CRLF.  However, the installed base of use of CRLF and the
+   importance of forward compatibility with NVT and protocols that
+   assume it makes that impossible, so it is necessary to continue using
+   CRLF as the "New Line Function" ("NLF", see the terminology section
+   in that reference).
+
+Appendix D.  A Note about Related Future Work
+
+   Consideration should be given to a Telnet (or SSH [RFC4251]) option
+   to specify this type of stream and an FTP extension [RFC0959] to
+   permit a new "Unicode text" data TYPE.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 14]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+References
+
+Normative References
+
+   [ISO10646]        International Organization for Standardization,
+                     "Information Technology - Universal Multiple-Octet
+                     Coded Character Set (UCS) - Part 1: Architecture
+                     and Basic Multilingual Plane", ISO/
+                     IEC 10646-1:2000, October 2000.
+
+   [NFC]             Davis, M. and M. Duerst, "Unicode Standard Annex
+                     #15: Unicode Normalization Forms", October 2006,
+                     <http://www.unicode.org/reports/tr15/>.
+
+   [RFC2119]         Bradner, S., "Key words for use in RFCs to Indicate
+                     Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+   [RFC3629]         Yergeau, F., "UTF-8, a transformation format of ISO
+                     10646", STD 63, RFC 3629, November 2003.
+
+   [RFC5234]         Crocker, D. and P. Overell, "Augmented BNF for
+                     Syntax Specifications: ABNF", STD 68, RFC 5234,
+                     January 2008.
+
+   [Unicode]         The Unicode Consortium, "The Unicode Standard,
+                     Version 5.0", 2007.
+
+                     Boston, MA, USA: Addison-Wesley.  ISBN
+                     0-321-48091-0
+
+   [Unicode32]       The Unicode Consortium, "The Unicode Standard,
+                     Version 3.0", 2000.
+
+                     (Reading, MA, Addison-Wesley, 2000.  ISBN 0-201-
+                     61633-5).  Version 3.2 consists of the definition
+                     in that book as amended by the Unicode Standard
+                     Annex #27: Unicode 3.1
+                     (http://www.unicode.org/reports/tr27/) and by the
+                     Unicode Standard Annex #28: Unicode 3.2
+                     (http://www.unicode.org/reports/tr28/).
+
+
+
+
+
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 15]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+Informative References
+
+   [ASCII]           American National Standards Institute (formerly
+                     United States of America Standards Institute), "USA
+                     Code for Information Interchange", ANSI X3.4-1968,
+                     1968.
+
+                     ANSI X3.4-1968 has been replaced by newer versions
+                     with slight modifications, but the 1968 version
+                     remains definitive for the Internet.  ISO 646
+                     International Reverence Version (IRV)
+                     [ISO.646.1991] is usually considered equivalent to
+                     ASCII.
+
+   [ISO.646.1991]    International Organization for Standardization,
+                     "Information technology - ISO 7-bit coded character
+                     set for information interchange", ISO Standard 646,
+                     1991.
+
+   [NamedSequences]  The Unicode Consortium, "NamedSequences-4.1.0.txt",
+                     2005, <http://www.unicode.org/Public/UNIDATA/
+                     NamedSequences.txt>.
+
+   [RFC0020]         Cerf, V., "ASCII format for network interchange",
+                     RFC 20, October 1969.
+
+   [RFC0097]         Melvin, J. and R. Watson, "First Cut at a Proposed
+                     Telnet Protocol", RFC 97, February 1971.
+
+   [RFC0137]         O'Sullivan, T., "Telnet Protocol - a proposed
+                     document", RFC 137, April 1971.
+
+   [RFC0139]         O'Sullivan, T., "Discussion of Telnet Protocol",
+                     RFC 139, May 1971.
+
+   [RFC0318]         Postel, J., "Telnet Protocols", RFC 318,
+                     April 1972.
+
+   [RFC0542]         Neigus, N., "File Transfer Protocol", RFC 542,
+                     August 1973.
+
+   [RFC0698]         Mock, T., "Telnet extended ASCII option", RFC 698,
+                     July 1975.
+
+   [RFC0742]         Harrenstien, K., "NAME/FINGER Protocol", RFC 742,
+                     December 1977.
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 16]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+   [RFC0854]         Postel, J. and J. Reynolds, "Telnet Protocol
+                     Specification", STD 8, RFC 854, May 1983.
+
+   [RFC0954]         Harrenstien, K., Stahl, M., and E. Feinler,
+                     "NICNAME/WHOIS", RFC 954, October 1985.
+
+   [RFC0959]         Postel, J. and J. Reynolds, "File Transfer
+                     Protocol", STD 9, RFC 959, October 1985.
+
+   [RFC1945]         Berners-Lee, T., Fielding, R., and H. Nielsen,
+                     "Hypertext Transfer Protocol -- HTTP/1.0",
+                     RFC 1945, May 1996.
+
+   [RFC2277]         Alvestrand, H., "IETF Policy on Character Sets and
+                     Languages", BCP 18, RFC 2277, January 1998.
+
+   [RFC2616]         Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
+                     Masinter, L., Leach, P., and T. Berners-Lee,
+                     "Hypertext Transfer Protocol -- HTTP/1.1",
+                     RFC 2616, June 1999.
+
+   [RFC2781]         Hoffman, P. and F. Yergeau, "UTF-16, an encoding of
+                     ISO 10646", RFC 2781, February 2000.
+
+   [RFC2821]         Klensin, J., "Simple Mail Transfer Protocol",
+                     RFC 2821, April 2001.
+
+   [RFC3454]         Hoffman, P. and M. Blanchet, "Preparation of
+                     Internationalized Strings ("stringprep")",
+                     RFC 3454, December 2002.
+
+   [RFC3491]         Hoffman, P. and M. Blanchet, "Nameprep: A
+                     Stringprep Profile for Internationalized Domain
+                     Names (IDN)", RFC 3491, March 2003.
+
+   [RFC3912]         Daigle, L., "WHOIS Protocol Specification",
+                     RFC 3912, September 2004.
+
+   [RFC4251]         Ylonen, T. and C. Lonvick, "The Secure Shell (SSH)
+                     Protocol Architecture", RFC 4251, January 2006.
+
+   [RFC4690]         Klensin, J., Faltstrom, P., Karp, C., and IAB,
+                     "Review and Recommendations for Internationalized
+                     Domain Names (IDNs)", RFC 4690, September 2006.
+
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 17]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+Authors' Addresses
+
+   John C Klensin
+   1770 Massachusetts Ave, #322
+   Cambridge, MA  02140
+   USA
+
+   Phone: +1 617 491 5735
+   EMail: john-ietf@jck.com
+
+
+   Michael A. Padlipsky
+   8011 Stewart Ave.
+   Los Angeles, CA  90045
+   USA
+
+   Phone: +1 310-670-4288
+   EMail: the.map@alum.mit.edu
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 18]
+
+RFC 5198                    Network Unicode                   March 2008
+
+
+Full Copyright Statement
+
+   Copyright (C) The IETF Trust (2008).
+
+   This document is subject to the rights, licenses and restrictions
+   contained in BCP 78, and except as set forth therein, the authors
+   retain all their rights.
+
+   This document and the information contained herein are provided on an
+   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
+   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
+   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
+   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+   The IETF takes no position regarding the validity or scope of any
+   Intellectual Property Rights or other rights that might be claimed to
+   pertain to the implementation or use of the technology described in
+   this document or the extent to which any license under such rights
+   might or might not be available; nor does it represent that it has
+   made any independent effort to identify any such rights.  Information
+   on the procedures with respect to rights in RFC documents can be
+   found in BCP 78 and BCP 79.
+
+   Copies of IPR disclosures made to the IETF Secretariat and any
+   assurances of licenses to be made available, or the result of an
+   attempt made to obtain a general license or permission for the use of
+   such proprietary rights by implementers or users of this
+   specification can be obtained from the IETF on-line IPR repository at
+   http://www.ietf.org/ipr.
+
+   The IETF invites any interested party to bring to its attention any
+   copyrights, patents or patent applications, or other proprietary
+   rights that may cover technology that may be required to implement
+   this standard.  Please address the information to the IETF at
+   ietf-ipr@ietf.org.
+
+
+
+
+
+
+
+
+
+
+
+
+Klensin & Padlipsky         Standards Track                    [Page 19]
+