From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001
From: Thomas Voss <mail@thomasvoss.com>
Date: Wed, 27 Nov 2024 20:54:24 +0100
Subject: doc: Add RFC documents

---
 doc/rfc/rfc3629.txt | 787 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 787 insertions(+)
 create mode 100644 doc/rfc/rfc3629.txt

(limited to 'doc/rfc/rfc3629.txt')

diff --git a/doc/rfc/rfc3629.txt b/doc/rfc/rfc3629.txt
new file mode 100644
index 0000000..e3070c7
--- /dev/null
+++ b/doc/rfc/rfc3629.txt
@@ -0,0 +1,787 @@
+
+
+
+
+
+
+Network Working Group                                         F. Yergeau
+Request for Comments: 3629                             Alis Technologies
+STD: 63                                                    November 2003
+Obsoletes: 2279
+Category: Standards Track
+
+
+              UTF-8, a transformation format of ISO 10646
+
+Status of this Memo
+
+   This document specifies an Internet standards track protocol for the
+   Internet community, and requests discussion and suggestions for
+   improvements.  Please refer to the current edition of the "Internet
+   Official Protocol Standards" (STD 1) for the standardization state
+   and status of this protocol.  Distribution of this memo is unlimited.
+
+Copyright Notice
+
+   Copyright (C) The Internet Society (2003).  All Rights Reserved.
+
+Abstract
+
+   ISO/IEC 10646-1 defines a large character set called the Universal
+   Character Set (UCS) which encompasses most of the world's writing
+   systems.  The originally proposed encodings of the UCS, however, were
+   not compatible with many current applications and protocols, and this
+   has led to the development of UTF-8, the object of this memo.  UTF-8
+   has the characteristic of preserving the full US-ASCII range,
+   providing compatibility with file systems, parsers and other software
+   that rely on US-ASCII values but are transparent to other values.
+   This memo obsoletes and replaces RFC 2279.
+
+Table of Contents
+
+   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
+   2.  Notational conventions . . . . . . . . . . . . . . . . . . . .  3
+   3.  UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . .  4
+   4.  Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . .  5
+   5.  Versions of the standards  . . . . . . . . . . . . . . . . . .  6
+   6.  Byte order mark (BOM)  . . . . . . . . . . . . . . . . . . . .  6
+   7.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
+   8.  MIME registration  . . . . . . . . . . . . . . . . . . . . . .  9
+   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
+   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 10
+   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
+   12. Changes from RFC 2279  . . . . . . . . . . . . . . . . . . . . 11
+   13. Normative References . . . . . . . . . . . . . . . . . . . . . 12
+
+
+
+Yergeau                     Standards Track                     [Page 1]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   14. Informative References . . . . . . . . . . . . . . . . . . . . 12
+   15. URI's  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
+   16. Intellectual Property Statement  . . . . . . . . . . . . . . . 13
+   17. Author's Address . . . . . . . . . . . . . . . . . . . . . . . 13
+   18. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 14
+
+1. Introduction
+
+   ISO/IEC 10646 [ISO.10646] defines a large character set called the
+   Universal Character Set (UCS), which encompasses most of the world's
+   writing systems.  The same set of characters is defined by the
+   Unicode standard [UNICODE], which further defines additional
+   character properties and other application details of great interest
+   to implementers.  Up to the present time, changes in Unicode and
+   amendments and additions to ISO/IEC 10646 have tracked each other, so
+   that the character repertoires and code point assignments have
+   remained in sync.  The relevant standardization committees have
+   committed to maintain this very useful synchronism.
+
+   ISO/IEC 10646 and Unicode define several encoding forms of their
+   common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.  In an
+   encoding form, each character is represented as one or more encoding
+   units.  All standard UCS encoding forms except UTF-8 have an encoding
+   unit larger than one octet, making them hard to use in many current
+   applications and protocols that assume 8 or even 7 bit characters.
+
+   UTF-8, the object of this memo, has a one-octet encoding unit.  It
+   uses all bits of an octet, but has the quality of preserving the full
+   US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one
+   octet having the normal US-ASCII value, and any octet with such a
+   value can only stand for a US-ASCII character, and nothing else.
+
+   UTF-8 encodes UCS characters as a varying number of octets, where the
+   number of octets, and the value of each, depend on the integer value
+   assigned to the character in ISO/IEC 10646 (the character number,
+   a.k.a. code position, code point or Unicode scalar value).  This
+   encoding form has the following characteristics (all values are in
+   hexadecimal):
+
+   o  Character numbers from U+0000 to U+007F (US-ASCII repertoire)
+      correspond to octets 00 to 7F (7 bit US-ASCII values).  A direct
+      consequence is that a plain ASCII string is also a valid UTF-8
+      string.
+
+
+
+
+
+
+
+
+Yergeau                     Standards Track                     [Page 2]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   o  US-ASCII octet values do not appear otherwise in a UTF-8 encoded
+      character stream.  This provides compatibility with file systems
+      or other software (e.g., the printf() function in C libraries)
+      that parse based on US-ASCII values but are transparent to other
+      values.
+
+   o  Round-trip conversion is easy between UTF-8 and other encoding
+      forms.
+
+   o  The first octet of a multi-octet sequence indicates the number of
+      octets in the sequence.
+
+   o  The octet values C0, C1, F5 to FF never appear.
+
+   o  Character boundaries are easily found from anywhere in an octet
+      stream.
+
+   o  The byte-value lexicographic sorting order of UTF-8 strings is the
+      same as if ordered by character numbers.  Of course this is of
+      limited interest since a sort order based on character numbers is
+      almost never culturally valid.
+
+   o  The Boyer-Moore fast search algorithm can be used with UTF-8 data.
+
+   o  UTF-8 strings can be fairly reliably recognized as such by a
+      simple algorithm, i.e., the probability that a string of
+      characters in any other encoding appears as valid UTF-8 is low,
+      diminishing with increasing string length.
+
+   UTF-8 was devised in September 1992 by Ken Thompson, guided by design
+   criteria specified by Rob Pike, with the objective of defining a UCS
+   transformation format usable in the Plan9 operating system in a non-
+   disruptive manner.  Thompson's design was stewarded through
+   standardization by the X/Open Joint Internationalization Group XOJIG
+   (see [FSS_UTF]), bearing the names FSS-UTF (variant FSS/UTF), UTF-2
+   and finally UTF-8 along the way.
+
+2.  Notational conventions
+
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+   document are to be interpreted as described in [RFC2119].
+
+   UCS characters are designated by the U+HHHH notation, where HHHH is a
+   string of from 4 to 6 hexadecimal digits representing the character
+   number in ISO/IEC 10646.
+
+
+
+
+
+Yergeau                     Standards Track                     [Page 3]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+3.  UTF-8 definition
+
+   UTF-8 is defined by the Unicode Standard [UNICODE].  Descriptions and
+   formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
+
+   In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
+   accessible range) are encoded using sequences of 1 to 4 octets.  The
+   only octet of a "sequence" of one has the higher-order bit set to 0,
+   the remaining 7 bits being used to encode the character number.  In a
+   sequence of n octets, n>1, the initial octet has the n higher-order
+   bits set to 1, followed by a bit set to 0.  The remaining bit(s) of
+   that octet contain bits from the number of the character to be
+   encoded.  The following octet(s) all have the higher-order bit set to
+   1 and the following bit set to 0, leaving 6 bits in each to contain
+   bits from the character to be encoded.
+
+   The table below summarizes the format of these different octet types.
+   The letter x indicates bits available for encoding bits of the
+   character number.
+
+   Char. number range  |        UTF-8 octet sequence
+      (hexadecimal)    |              (binary)
+   --------------------+---------------------------------------------
+   0000 0000-0000 007F | 0xxxxxxx
+   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
+   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
+   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+
+   Encoding a character to UTF-8 proceeds as follows:
+
+   1.  Determine the number of octets required from the character number
+       and the first column of the table above.  It is important to note
+       that the rows of the table are mutually exclusive, i.e., there is
+       only one valid way to encode a given character.
+
+   2.  Prepare the high-order bits of the octets as per the second
+       column of the table.
+
+   3.  Fill in the bits marked x from the bits of the character number,
+       expressed in binary.  Start by putting the lowest-order bit of
+       the character number in the lowest-order position of the last
+       octet of the sequence, then put the next higher-order bit of the
+       character number in the next higher-order position of that octet,
+       etc.  When the x bits of the last octet are filled in, move on to
+       the next to last octet, then to the preceding one, etc. until all
+       x bits are filled in.
+
+
+
+
+
+Yergeau                     Standards Track                     [Page 4]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   The definition of UTF-8 prohibits encoding character numbers between
+   U+D800 and U+DFFF, which are reserved for use with the UTF-16
+   encoding form (as surrogate pairs) and do not directly represent
+   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
+   to first decode the UTF-16 data to obtain character numbers, which
+   are then encoded in UTF-8 as described above.  This contrasts with
+   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
+   use on the Internet.  CESU-8 operates similarly to UTF-8 but encodes
+   the UTF-16 code values (16-bit quantities) instead of the character
+   number (code point).  This leads to different results for character
+   numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
+   valid UTF-8.
+
+   Decoding a UTF-8 character proceeds as follows:
+
+   1.  Initialize a binary number with all bits set to 0.  Up to 21 bits
+       may be needed.
+
+   2.  Determine which bits encode the character number from the number
+       of octets in the sequence and the second column of the table
+       above (the bits marked x).
+
+   3.  Distribute the bits from the sequence to the binary number, first
+       the lower-order bits from the last octet of the sequence and
+       proceeding to the left until no x bits are left.  The binary
+       number is now equal to the character number.
+
+   Implementations of the decoding algorithm above MUST protect against
+   decoding invalid sequences.  For instance, a naive implementation may
+   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
+   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
+   invalid sequences may have security consequences or cause other
+   problems.  See Security Considerations (Section 10) below.
+
+4.  Syntax of UTF-8 Byte Sequences
+
+   For the convenience of implementors using ABNF, a definition of UTF-8
+   in ABNF syntax is given here.
+
+   A UTF-8 string is a sequence of octets representing a sequence of UCS
+   characters.  An octet sequence is valid UTF-8 only if it matches the
+   following syntax, which is derived from the rules for encoding UTF-8
+   and is expressed in the ABNF of [RFC2234].
+
+   UTF8-octets = *( UTF8-char )
+   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
+   UTF8-1      = %x00-7F
+   UTF8-2      = %xC2-DF UTF8-tail
+
+
+
+Yergeau                     Standards Track                     [Page 5]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
+                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
+   UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
+                 %xF4 %x80-8F 2( UTF8-tail )
+   UTF8-tail   = %x80-BF
+
+   NOTE -- The authoritative definition of UTF-8 is in [UNICODE].  This
+   grammar is believed to describe the same thing Unicode describes, but
+   does not claim to be authoritative.  Implementors are urged to rely
+   on the authoritative source, rather than on this ABNF.
+
+5.  Versions of the standards
+
+   ISO/IEC 10646 is updated from time to time by publication of
+   amendments and additional parts; similarly, new versions of the
+   Unicode standard are published over time.  Each new version obsoletes
+   and replaces the previous one, but implementations, and more
+   significantly data, are not updated instantly.
+
+   In general, the changes amount to adding new characters, which does
+   not pose particular problems with old data.  In 1996, Amendment 5 to
+   the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded
+   the Korean Hangul block, thereby making any previous data containing
+   Hangul characters invalid under the new version.  Unicode 2.0 has the
+   same difference from Unicode 1.1.  The justification for allowing
+   such an incompatible change was that there were no major
+   implementations and no significant amounts of data containing Hangul.
+   The incident has been dubbed the "Korean mess", and the relevant
+   committees have pledged to never, ever again make such an
+   incompatible change (see Unicode Consortium Policies [1]).
+
+   New versions, and in particular any incompatible changes, have
+   consequences regarding MIME charset labels, to be discussed in MIME
+   registration (Section 8).
+
+6.  Byte order mark (BOM)
+
+   The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
+   informally as "BYTE ORDER MARK" (abbreviated "BOM").  This character
+   can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but
+   the BOM name hints at a second possible usage of the character:  to
+   prepend a U+FEFF character to a stream of UCS characters as a
+   "signature".  A receiver of such a serialized stream may then use the
+   initial character as a hint that the stream consists of UCS
+   characters and also to recognize which UCS encoding is involved and,
+   with encodings having a multi-octet encoding unit, as a way to
+
+
+
+
+
+Yergeau                     Standards Track                     [Page 6]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   recognize the serialization order of the octets.  UTF-8 having a
+   single-octet encoding unit, this last function is useless and the BOM
+   will always appear as the octet sequence EF BB BF.
+
+   It is important to understand that the character U+FEFF appearing at
+   any position other than the beginning of a stream MUST be interpreted
+   with the semantics for the zero-width non-breaking space, and MUST
+   NOT be interpreted as a signature.  When interpreted as a signature,
+   the Unicode standard suggests than an initial U+FEFF character may be
+   stripped before processing the text.  Such stripping is necessary in
+   some cases (e.g., when concatenating two strings, because otherwise
+   the resulting string may contain an unintended "ZERO WIDTH NO-BREAK
+   SPACE" at the connection point), but might affect an external process
+   at a different layer (such as a digital signature or a count of the
+   characters) that is relying on the presence of all characters in the
+   stream.  It is therefore RECOMMENDED to avoid stripping an initial
+   U+FEFF interpreted as a signature without a good reason, to ignore it
+   instead of stripping it when appropriate (such as for display) and to
+   strip it only when really necessary.
+
+   U+FEFF in the first position of a stream MAY be interpreted as a
+   zero-width non-breaking space, and is not always a signature.  In an
+   attempt at diminishing this uncertainty, Unicode 3.2 adds a new
+   character, U+2060 "WORD JOINER", with exactly the same semantics and
+   usage as U+FEFF except for the signature function, and strongly
+   recommends its exclusive use for expressing word-joining semantics.
+   Eventually, following this recommendation will make it all but
+   certain that any initial U+FEFF is a signature, not an intended "ZERO
+   WIDTH NO-BREAK SPACE".
+
+   In the meantime, the uncertainty unfortunately remains and may affect
+   Internet protocols.  Protocol specifications MAY restrict usage of
+   U+FEFF as a signature in order to reduce or eliminate the potential
+   ill effects of this uncertainty.  In the interest of striking a
+   balance between the advantages (reduction of uncertainty) and
+   drawbacks (loss of the signature function) of such restrictions, it
+   is useful to distinguish a few cases:
+
+   o  A protocol SHOULD forbid use of U+FEFF as a signature for those
+      textual protocol elements that the protocol mandates to be always
+      UTF-8, the signature function being totally useless in those
+      cases.
+
+   o  A protocol SHOULD also forbid use of U+FEFF as a signature for
+      those textual protocol elements for which the protocol provides
+      character encoding identification mechanisms, when it is expected
+      that implementations of the protocol will be in a position to
+      always use the mechanisms properly.  This will be the case when
+
+
+
+Yergeau                     Standards Track                     [Page 7]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+      the protocol elements are maintained tightly under the control of
+      the implementation from the time of their creation to the time of
+      their (properly labeled) transmission.
+
+   o  A protocol SHOULD NOT forbid use of U+FEFF as a signature for
+      those textual protocol elements for which the protocol does not
+      provide character encoding identification mechanisms, when a ban
+      would be unenforceable, or when it is expected that
+      implementations of the protocol will not be in a position to
+      always use the mechanisms properly.  The latter two cases are
+      likely to occur with larger protocol elements such as MIME
+      entities, especially when implementations of the protocol will
+      obtain such entities from file systems, from protocols that do not
+      have encoding identification mechanisms for payloads (such as FTP)
+      or from other protocols that do not guarantee proper
+      identification of character encoding (such as HTTP).
+
+   When a protocol forbids use of U+FEFF as a signature for a certain
+   protocol element, then any initial U+FEFF in that protocol element
+   MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE".  When a
+   protocol does NOT forbid use of U+FEFF as a signature for a certain
+   protocol element, then implementations SHOULD be prepared to handle a
+   signature in that element and react appropriately: using the
+   signature to identify the character encoding as necessary and
+   stripping or ignoring the signature as appropriate.
+
+7.  Examples
+
+   The character sequence U+0041 U+2262 U+0391 U+002E "A<NOT IDENTICAL
+   TO><ALPHA>." is encoded in UTF-8 as follows:
+
+       --+--------+-----+--
+       41 E2 89 A2 CE 91 2E
+       --+--------+-----+--
+
+   The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo",
+   meaning "the Korean language") is encoded in UTF-8 as follows:
+
+       --------+--------+--------
+       ED 95 9C EA B5 AD EC 96 B4
+       --------+--------+--------
+
+   The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo",
+   meaning "the Japanese language") is encoded in UTF-8 as follows:
+
+       --------+--------+--------
+       E6 97 A5 E6 9C AC E8 AA 9E
+       --------+--------+--------
+
+
+
+Yergeau                     Standards Track                     [Page 8]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   The character U+233B4 (a Chinese character meaning 'stump of tree'),
+   prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:
+
+       --------+-----------
+       EF BB BF F0 A3 8E B4
+       --------+-----------
+
+8.  MIME registration
+
+   This memo serves as the basis for registration of the MIME charset
+   parameter for UTF-8, according to [RFC2978].  The charset parameter
+   value is "UTF-8".  This string labels media types containing text
+   consisting of characters from the repertoire of ISO/IEC 10646
+   including all amendments at least up to amendment 5 of the 1993
+   edition (Korean block), encoded to a sequence of octets using the
+   encoding scheme outlined above.  UTF-8 is suitable for use in MIME
+   content types under the "text" top-level type.
+
+   It is noteworthy that the label "UTF-8" does not contain a version
+   identification, referring generically to ISO/IEC 10646.  This is
+   intentional, the rationale being as follows:
+
+   A MIME charset label is designed to give just the information needed
+   to interpret a sequence of bytes received on the wire into a sequence
+   of characters, nothing more (see [RFC2045], section 2.2).  As long as
+   a character set standard does not change incompatibly, version
+   numbers serve no purpose, because one gains nothing by learning from
+   the tag that newly assigned characters may be received that one
+   doesn't know about.  The tag itself doesn't teach anything about the
+   new characters, which are going to be received anyway.
+
+   Hence, as long as the standards evolve compatibly, the apparent
+   advantage of having labels that identify the versions is only that,
+   apparent.  But there is a disadvantage to such version-dependent
+   labels: when an older application receives data accompanied by a
+   newer, unknown label, it may fail to recognize the label and be
+   completely unable to deal with the data, whereas a generic, known
+   label would have triggered mostly correct processing of the data,
+   which may well not contain any new characters.
+
+   Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
+   change, in principle contradicting the appropriateness of a version
+   independent MIME charset label as described above.  But the
+   compatibility problem can only appear with data containing Korean
+   Hangul characters encoded according to Unicode 1.1 (or equivalently
+   ISO/IEC 10646 before amendment 5), and there is arguably no such data
+   to worry about, this being the very reason the incompatible change
+   was deemed acceptable.
+
+
+
+Yergeau                     Standards Track                     [Page 9]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   In practice, then, a version-independent label is warranted, provided
+   the label is understood to refer to all versions after Amendment 5,
+   and provided no incompatible change actually occurs.  Should
+   incompatible changes occur in a later version of ISO/IEC 10646, the
+   MIME charset label defined here will stay aligned with the previous
+   version until and unless the IETF specifically decides otherwise.
+
+9.  IANA Considerations
+
+   The entry for UTF-8 in the IANA charset registry has been updated to
+   point to this memo.
+
+10.  Security Considerations
+
+   Implementers of UTF-8 need to consider the security aspects of how
+   they handle illegal UTF-8 sequences.  It is conceivable that in some
+   circumstances an attacker would be able to exploit an incautious
+   UTF-8 parser by sending it an octet sequence that is not permitted by
+   the UTF-8 syntax.
+
+   A particularly subtle form of this attack can be carried out against
+   a parser which performs security-critical validity checks against the
+   UTF-8 encoded form of its input, but interprets certain illegal octet
+   sequences as characters.  For example, a parser might prohibit the
+   NUL character when encoded as the single-octet sequence 00, but
+   erroneously allow the illegal two-octet sequence C0 80 and interpret
+   it as a NUL character.  Another example might be a parser which
+   prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
+   illegal octet sequence 2F C0 AE 2E 2F.  This last exploit has
+   actually been used in a widespread virus attacking Web servers in
+   2001; thus, the security threat is very real.
+
+   Another security issue occurs when encoding to UTF-8: the ISO/IEC
+   10646 description of UTF-8 allows encoding character numbers up to
+   U+7FFFFFFF, yielding sequences of up to 6 bytes.  There is therefore
+   a risk of buffer overflow if the range of character numbers is not
+   explicitly limited to U+10FFFF or if buffer sizing doesn't take into
+   account the possibility of 5- and 6-byte sequences.
+
+   Security may also be impacted by a characteristic of several
+   character encodings, including UTF-8: the "same thing" (as far as a
+   user can tell) can be represented by several distinct character
+   sequences.  For instance, an e with acute accent can be represented
+   by the precomposed U+00E9 E ACUTE character or by the canonically
+   equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE).  Even though
+   UTF-8 provides a single byte sequence for each character sequence,
+   the existence of multiple character sequences for "the same thing"
+   may have security consequences whenever string matching, indexing,
+
+
+
+Yergeau                     Standards Track                    [Page 10]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   searching, sorting, regular expression matching and selection are
+   involved.  An example would be string matching of an identifier
+   appearing in a credential and in access control list entries.  This
+   issue is amenable to solutions based on Unicode Normalization Forms,
+   see [UAX15].
+
+11.  Acknowledgements
+
+   The following have participated in the drafting and discussion of
+   this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer,
+   Mark Davis, Martin J. Duerst, Patrick Faltstrom, Ned Freed, David
+   Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood,
+   Simon Josefsson, Kent Karlsson, Dan Kohn, Markus Kuhn, Michael Kung,
+   Alain LaBonte, Ira McDonald, Alexey Melnikov, MURATA Makoto, John
+   Gardiner Myers, Chris Newman, Dan Oscarsson, Roozbeh Pournader,
+   Murray Sargent, Markus Scherer, Keld Simonsen, Arnold Winkler,
+   Kenneth Whistler and Misha Wolf.
+
+12.  Changes from RFC 2279
+
+   o  Restricted the range of characters to 0000-10FFFF (the UTF-16
+      accessible range).
+
+   o  Made Unicode the source of the normative definition of UTF-8,
+      keeping ISO/IEC 10646 as the reference for characters.
+
+   o  Straightened out terminology.  UTF-8 now described in terms of an
+      encoding form of the character number.  UCS-2 and UCS-4 almost
+      disappeared.
+
+   o  Turned the note warning against decoding of invalid sequences into
+      a normative MUST NOT.
+
+   o  Added a new section about the UTF-8 BOM, with advice for
+      protocols.
+
+   o  Removed suggested UNICODE-1-1-UTF-8 MIME charset registration.
+
+   o  Added an ABNF syntax for valid UTF-8 octet sequences
+
+   o  Expanded Security Considerations section, in particular impact of
+      Unicode normalization
+
+
+
+
+
+
+
+
+
+Yergeau                     Standards Track                    [Page 11]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+13.  Normative References
+
+   [RFC2119]   Bradner, S., "Key words for use in RFCs to Indicate
+               Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+   [ISO.10646] International Organization for Standardization,
+               "Information Technology - Universal Multiple-octet coded
+               Character Set (UCS)", ISO/IEC Standard 10646,  comprised
+               of ISO/IEC 10646-1:2000, "Information technology --
+               Universal Multiple-Octet Coded Character Set (UCS) --
+               Part 1: Architecture and Basic Multilingual Plane",
+               ISO/IEC 10646-2:2001, "Information technology --
+               Universal Multiple-Octet Coded Character Set (UCS) --
+               Part 2:  Supplementary Planes" and ISO/IEC 10646-
+               1:2000/Amd 1:2002, "Mathematical symbols and other
+               characters".
+
+   [UNICODE]   The Unicode Consortium, "The Unicode Standard -- Version
+               4.0",  defined by The Unicode Standard, Version 4.0
+               (Boston, MA, Addison-Wesley, 2003.  ISBN 0-321-18578-1),
+               April 2003, <http://www.unicode.org/unicode/standard/
+               versions/enumeratedversions.html#Unicode_4_0_0>.
+
+14.  Informative References
+
+   [CESU-8]    Phipps, T., "Unicode Technical Report #26: Compatibility
+               Encoding Scheme for UTF-16: 8-Bit (CESU-8)", UTR 26,
+               April 2002,
+               <http://www.unicode.org/unicode/reports/tr26/>.
+
+   [FSS_UTF]   X/Open Company Ltd., "X/Open Preliminary Specification --
+               File System Safe UCS Transformation Format (FSS-UTF)",
+               May 1993, <http://wwwold.dkuug.dk/jtc1/sc22/wg20/docs/
+               N193-FSS-UTF.pdf>.
+
+   [RFC2045]   Freed, N. and N. Borenstein, "Multipurpose Internet Mail
+               Extensions (MIME) Part One: Format of Internet Message
+               Bodies", RFC 2045, November 1996.
+
+   [RFC2234]   Crocker, D. and P. Overell, "Augmented BNF for Syntax
+               Specifications: ABNF", RFC 2234, November 1997.
+
+   [RFC2978]   Freed, N. and J. Postel, "IANA Charset Registration
+               Procedures", BCP 19, RFC 2978, October 2000.
+
+
+
+
+
+
+
+Yergeau                     Standards Track                    [Page 12]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+   [UAX15]     Davis, M. and M. Duerst, "Unicode Standard Annex #15:
+               Unicode Normalization Forms",  An integral part of The
+               Unicode Standard, Version 4.0.0, April 2003, <http://
+               www.unicode.org/unicode/reports/tr15>.
+
+   [US-ASCII]  American National Standards Institute, "Coded Character
+               Set - 7-bit American Standard Code for Information
+               Interchange", ANSI X3.4, 1986.
+
+15.  URIs
+
+   [1]  <http://www.unicode.org/unicode/standard/policies.html>
+
+16.  Intellectual Property Statement
+
+   The IETF takes no position regarding the validity or scope of any
+   intellectual property or other rights that might be claimed to
+   pertain to the implementation or use of the technology described in
+   this document or the extent to which any license under such rights
+   might or might not be available; neither does it represent that it
+   has made any effort to identify any such rights.  Information on the
+   IETF's procedures with respect to rights in standards-track and
+   standards-related documentation can be found in BCP-11.  Copies of
+   claims of rights made available for publication and any assurances of
+   licenses to be made available, or the result of an attempt made to
+   obtain a general license or permission for the use of such
+   proprietary rights by implementors or users of this specification can
+   be obtained from the IETF Secretariat.
+
+   The IETF invites any interested party to bring to its attention any
+   copyrights, patents or patent applications, or other proprietary
+   rights which may cover technology that may be required to practice
+   this standard.  Please address the information to the IETF Executive
+   Director.
+
+17.  Author's Address
+
+   Francois Yergeau
+   Alis Technologies
+   100, boul. Alexis-Nihon, bureau 600
+   Montreal, QC  H4M 2P2
+   Canada
+
+   Phone: +1 514 747 2547
+   Fax:   +1 514 747 2561
+   EMail: fyergeau@alis.com
+
+
+
+
+
+Yergeau                     Standards Track                    [Page 13]
+
+RFC 3629                         UTF-8                     November 2003
+
+
+18.  Full Copyright Statement
+
+   Copyright (C) The Internet Society (2003).  All Rights Reserved.
+
+   This document and translations of it may be copied and furnished to
+   others, and derivative works that comment on or otherwise explain it
+   or assist in its implementation may be prepared, copied, published
+   and distributed, in whole or in part, without restriction of any
+   kind, provided that the above copyright notice and this paragraph are
+   included on all such copies and derivative works.  However, this
+   document itself may not be modified in any way, such as by removing
+   the copyright notice or references to the Internet Society or other
+   Internet organizations, except as needed for the purpose of
+   developing Internet standards in which case the procedures for
+   copyrights defined in the Internet Standards process must be
+   followed, or as required to translate it into languages other than
+   English.
+
+   The limited permissions granted above are perpetual and will not be
+   revoked by the Internet Society or its successors or assignees.
+
+   This document and the information contained herein is provided on an
+   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
+   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
+   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
+   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
+   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Acknowledgement
+
+   Funding for the RFC Editor function is currently provided by the
+   Internet Society.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Yergeau                     Standards Track                    [Page 14]
+
-- 
cgit v1.2.3