1 files changed, 395 insertions, 0 deletions
diff --git a/doc/rfc/rfc5895.txt b/doc/rfc/rfc5895.txt
new file mode 100644
index 0000000..446d16a
--- /dev/null
+++ b/doc/rfc/rfc5895.txt
@@ -0,0 +1,395 @@
+
+
+
+
+
+
+Independent Submission                                        P. Resnick
+Request for Comments: 5895                         Qualcomm Incorporated
+Category: Informational                                       P. Hoffman
+ISSN: 2070-1721                                           VPN Consortium
+                                                          September 2010
+
+
+                         Mapping Characters for
+       Internationalized Domain Names in Applications (IDNA) 2008
+
+Abstract
+
+   In the original version of the Internationalized Domain Names in
+   Applications (IDNA) protocol, any Unicode code points taken from user
+   input were mapped into a set of Unicode code points that "made
+   sense", and then encoded and passed to the domain name system (DNS).
+   The IDNA2008 protocol (described in RFCs 5890, 5891, 5892, and 5893)
+   presumes that the input to the protocol comes from a set of
+   "permitted" code points, which it then encodes and passes to the DNS,
+   but does not specify what to do with the result of user input.  This
+   document describes the actions that can be taken by an implementation
+   between receiving user input and passing permitted code points to the
+   new IDNA protocol.
+
+Status of This Memo
+
+   This document is not an Internet Standards Track specification; it is
+   published for informational purposes.
+
+   This is a contribution to the RFC Series, independently of any other
+   RFC stream.  The RFC Editor has chosen to publish this document at
+   its discretion and makes no statement about its value for
+   implementation or deployment.  Documents approved for publication by
+   the RFC Editor are not a candidate for any level of Internet
+   Standard; see Section 2 of RFC 5741.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   http://www.rfc-editor.org/info/rfc5895.
+
+
+
+
+
+
+
+
+
+
+
+
+Resnick & Hoffman             Informational                     [Page 1]
+
+RFC 5895                      IDNA Mapping                September 2010
+
+
+Copyright Notice
+
+   Copyright (c) 2010 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.
+
+1.  Introduction
+
+   This document describes the operations that can be applied to user
+   input in order to get it into a form that is acceptable by the
+   Internationalized Domain Names in Applications (IDNA) protocol
+   [IDNA2008protocol].  It includes a general implementation procedure
+   for mapping.
+
+   It should be noted that this document does not specify the behavior
+   of a protocol that appears "on the wire".  It describes an operation
+   that is to be applied to user input in order to prepare that user
+   input for use in an "on the network" protocol.  As unusual as this
+   may be for a document concerning Internet protocols, it is necessary
+   to describe this operation for implementors who may have designed
+   around the original IDNA protocol (herein referred to as IDNA2003),
+   which conflates this user-input operation into the protocol.
+
+   It is very important to note that there are many potential valid
+   mappings of characters from user input.  The mapping described in
+   this document is the basis for other mappings, and is not likely to
+   be useful without modification.  Any useful mapping will have
+   features designed to reduce the surprise for users and is likely to
+   be slightly (or sometimes radically) different depending on the
+   locale of the user, the type of input being used (such as typing,
+   copy-and-paste, voice, and so on), the type of application used, etc.
+   Although most common mappings will probably produce similar results
+   for the same input, there will be subtle differences between
+   applications.
+
+1.1.  The Dividing Line between User Interface and Protocol
+
+   The user interface to applications is much more complicated than most
+   network implementers think.  When we say "the user enters an
+   internationalized domain name in the application", we are talking
+   about a very complex process that encompasses everything from the
+   user formulating the name and deciding which symbols to use to
+
+
+
+Resnick & Hoffman             Informational                     [Page 2]
+
+RFC 5895                      IDNA Mapping                September 2010
+
+
+   express that name, to the user entering the symbols into the computer
+   using some input method (be it a keyboard, a stylus, or even a voice
+   recognition program), to the computer interpreting that input (be it
+   keyboard scan codes, a graphical representation, or digitized sounds)
+   into some representation of those symbols, through finally
+   normalizing those symbols into a particular character repertoire in
+   an encoding recognizable to IDNA processes and the domain name
+   system.
+
+   Considerations for a user interface for internationalized domain
+   names involves taking into account culture, context, and locale for
+   any given user.  A simple and well-known example is the lowercasing
+   of the letter LATIN CAPITAL LETTER I (U+0049) when it is used in the
+   Turkish and other languages.  A capital "I" in Turkish is properly
+   lowercased to a LATIN SMALL LETTER DOTLESS I (U+0131), not to a LATIN
+   SMALL LETTER I (U+0069).  This lowercasing is clearly dependent on
+   the locale of the system and/or the locale of the user.  Using a
+   single context-free mapping without considering the user interface
+   properties has the potential of doing exactly the wrong thing for the
+   user.
+
+   The original version of IDNA conflated user interface processing and
+   protocol.  It took whatever characters the user produced in whatever
+   encoding the application used, assumed some conversion to Unicode
+   code points, and then without regard to context, locale, or anything
+   about the user's intentions, mapped them into a particular set of
+   other characters, and then re-encoded them in Punycode, in order to
+   have the entire operation be contained within the protocol.  Ignoring
+   context, locale, and user preference in the IDNA protocol made life
+   significantly less complicated for the application developer, but at
+   the expense of violating the principle of "least user surprise" for
+   consumers and producers of domain names.
+
+   In IDNA2008, the dividing line between "user interface" and
+   "protocol" is clear.  The IDNA2008 specification defines the protocol
+   part of IDNA: it explicitly does not deal with the user interface.
+   Mappings such as the one described in this document explicitly deal
+   with the user interface and not the protocol.  That is, a mapping is
+   only to be applied before a string of characters is treated as a
+   domain name (in the "user interface") and is never to be applied
+   during domain name processing (in the "protocol").
+
+1.2.  The Design of This Mapping
+
+   The user interface mapping in this document is a set of expansions to
+   IDNA2008 that are meant to be sensible and friendly and mostly
+   obvious to people throughout the world when using typical
+   applications with domain names that are entered by hand.  It is also
+
+
+
+Resnick & Hoffman             Informational                     [Page 3]
+
+RFC 5895                      IDNA Mapping                September 2010
+
+
+   designed to let applications be mostly backwards compatible with
+   IDNA2003.  By definition, it cannot meet all of those design goals
+   for all people, and in fact is known to fail on some of those goals
+   for quite large populations of people.
+
+   A good mapping in the real world might use the "sensible and friendly
+   and mostly obvious" design goal but come up with a different
+   algorithm.  Many algorithms will have results that are close to what
+   is described here, but will differ in assumptions about the users'
+   way of thinking or typing.  Having said that, it is likely that some
+   mappings will be significantly different.  For example, a mapping
+   might apply to a spoken user interface instead of a typed one.
+   Another example is that a mapping might be different for users that
+   are typing than for users that are copying-and-pasting from different
+   applications.  Yet another example is that a user interface that
+   allows typed input that is transliterated from Latin characters could
+   have very different mappings than one that applies to typing in other
+   character sets; this would be typical in a Pinyin input method for
+   Chinese characters.
+
+2.  The General Procedure
+
+   This section defines a general algorithm that applications ought to
+   implement in order to produce Unicode code points that will be valid
+   under the IDNA protocol.  An application might implement the full
+   mapping as described below, or it can choose a different mapping.
+   This mapping is very general and was designed to be acceptable to the
+   widest user community, but as stated above, it does not take into
+   account any particular context, culture, or locale.
+
+   The general algorithm that an application (or the input method
+   provided by an operating system) ought to use is relatively
+   straightforward:
+
+   1.  Uppercase characters are mapped to their lowercase equivalents by
+       using the algorithm for mapping case in Unicode characters.  This
+       step was chosen because the output will behave more like ASCII
+       host names behave.
+
+   2.  Fullwidth and halfwidth characters (those defined with
+       Decomposition Types <wide> and <narrow>) are mapped to their
+       decomposition mappings as shown in the Unicode character
+       database.  This step was chosen because many input mechanisms,
+       particularly in Asia, do not allow you to easily enter characters
+       in the form used by IDNA2008.  Even if they do allow the correct
+       character form, the user might not know which form they are
+       entering.
+
+
+
+
+Resnick & Hoffman             Informational                     [Page 4]
+
+RFC 5895                      IDNA Mapping                September 2010
+
+
+   3.  All characters are mapped using Unicode Normalization Form C
+       (NFC).  This step was chosen because it maps combinations of
+       combining characters into canonical composed form.  As with the
+       fullwidth/halfwidth mapping, users are not generally aware of the
+       particular form of characters that they are entering, and
+       IDNA2008 requires that only the canonical composed forms from NFC
+       be used.
+
+   4.  [IDNA2008protocol] is specified such that the protocol acts on
+       the individual labels of the domain name.  If an implementation
+       of this mapping is also performing the step of separation of the
+       parts of a domain name into labels by using the FULL STOP
+       character (U+002E), the IDEOGRAPHIC FULL STOP character (U+3002)
+       can be mapped to the FULL STOP before label separation occurs.
+       There are other characters that are used as "full stops" that one
+       could consider mapping as label separators, but their use as such
+       has not been investigated thoroughly.  This step was chosen
+       because some input mechanisms do not allow the user to easily
+       enter proper label separators.  Only the IDEOGRAPHIC FULL STOP
+       character (U+3002) is added in this mapping because the authors
+       have not fully investigated the applicability of other characters
+       and the environments where they should and should not be
+       considered domain name label separators.
+
+   Note that the steps above are ordered.
+
+   Definitions for the rules in this algorithm can be found in
+   [Unicode52].  Specifically:
+
+   o  Unicode Normalization Form C can be found in Annex #15 of
+      [Unicode-UAX15].
+
+   o  In order to map uppercase characters to their lowercase
+      equivalents (defined in Section 3.13 of [Unicode52]), first map
+      characters to the "Lowercase_Mapping" property (the "<lower>"
+      entry in the second column) in
+      <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any.
+      Then, map characters to the "Simple_Lowercase_Mapping" property
+      (the fourteenth column) in
+      <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.
+
+   o  In order to map fullwidth and halfwidth characters to their
+      decomposition mappings, map any character whose
+      "Decomposition_Type" (contained in the first part of the sixth
+      column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>
+      is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of
+      that character (contained in the second part of the sixth column)
+      in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>.
+
+
+
+Resnick & Hoffman             Informational                     [Page 5]
+
+RFC 5895                      IDNA Mapping                September 2010
+
+
+   o  The Unicode Character Database [TR44] has useful descriptions of
+      the contents of these files.
+
+   If the mappings in this document are applied to versions of Unicode
+   later than Unicode 5.2, the later versions of the Unicode Standard
+   should be consulted.
+
+   These form a minimal set of mappings that an application should
+   strongly consider doing.  Of course, there are many others that might
+   be done.
+
+3.  Implementing This Mapping
+
+   If you are implementing a mapping for an application or operating
+   system by using exactly the four steps in Section 2, the authors of
+   this document have a request: please don't.  We mean it.  Section 2
+   does not describe a universal mapping algorithm because, as we said,
+   there is no universally-applicable mapping algorithm.
+
+   If you read the material in Section 2 without reading Section 1, go
+   back and carefully read all of Section 1; in many ways, Section 1 is
+   more important than Section 2.  Further, you can probably think of
+   user interface considerations that we did not list in Section 1.  If
+   you did read Section 1 but somehow decided that the algorithm in
+   Section 2 is completely correct for the intended users of your
+   application or operating system, you are probably not thinking hard
+   enough about your intended users.
+
+4.  Security Considerations
+
+   This document suggests creating mappings that might cause confusion
+   for some users while alleviating confusion in other users.  Such
+   confusion is not covered in any depth in this document (nor in the
+   other IDNA-related documents).
+
+5.  Acknowledgements
+
+   This document is the product of many contributions from numerous
+   people in the IETF.
+
+
+
+
+
+
+
+
+
+
+
+
+Resnick & Hoffman             Informational                     [Page 6]
+
+RFC 5895                      IDNA Mapping                September 2010
+
+
+6.  Normative References
+
+   [IDNA2008protocol]  Klensin, J., "Internationalized Domain Names in
+                       Applications (IDNA): Protocol", RFC 5891,
+                       August 2010.
+
+   [TR44]              The Unicode Consortium, "Unicode Technical Report
+                       #44: Unicode Character Database", September 2009,
+                       <http://www.unicode.org/reports/tr44/
+                       tr44-4.html>.
+
+   [Unicode-UAX15]     The Unicode Consortium, "Unicode Standard Annex
+                       #15: Unicode Normalization Forms, Revision 31",
+                       September 2009, <http://www.unicode.org/reports/
+                       tr15/tr15-31.html>.
+
+   [Unicode52]         The Unicode Consortium.  The Unicode Standard,
+                       Version 5.2.0, defined by: "The Unicode Standard,
+                       Version 5.2.0", (Mountain View, CA: The Unicode
+                       Consortium, 2009. ISBN 978-1-936213-00-9).
+                       <http://www.unicode.org/versions/Unicode5.2.0/>.
+
+Authors' Addresses
+
+   Peter W. Resnick
+   Qualcomm Incorporated
+   5775 Morehouse Drive
+   San Diego, CA  92121-1714
+   US
+
+   Phone: +1 858 651 4478
+   EMail: presnick@qualcomm.com
+   URI:   http://www.qualcomm.com/~presnick/
+
+
+   Paul Hoffman
+   VPN Consortium
+   127 Segre Place
+   Santa Cruz, CA  95060
+   US
+
+   Phone: 1-831-426-9827
+   EMail: paul.hoffman@vpnc.org
+
+
+
+
+
+
+
+
+Resnick & Hoffman             Informational                     [Page 7]
+