diff options
Diffstat (limited to 'doc/rfc/rfc5895.txt')
| -rw-r--r-- | doc/rfc/rfc5895.txt | 395 | 
1 files changed, 395 insertions, 0 deletions
diff --git a/doc/rfc/rfc5895.txt b/doc/rfc/rfc5895.txt new file mode 100644 index 0000000..446d16a --- /dev/null +++ b/doc/rfc/rfc5895.txt @@ -0,0 +1,395 @@ + + + + + + +Independent Submission                                        P. Resnick +Request for Comments: 5895                         Qualcomm Incorporated +Category: Informational                                       P. Hoffman +ISSN: 2070-1721                                           VPN Consortium +                                                          September 2010 + + +                         Mapping Characters for +       Internationalized Domain Names in Applications (IDNA) 2008 + +Abstract + +   In the original version of the Internationalized Domain Names in +   Applications (IDNA) protocol, any Unicode code points taken from user +   input were mapped into a set of Unicode code points that "made +   sense", and then encoded and passed to the domain name system (DNS). +   The IDNA2008 protocol (described in RFCs 5890, 5891, 5892, and 5893) +   presumes that the input to the protocol comes from a set of +   "permitted" code points, which it then encodes and passes to the DNS, +   but does not specify what to do with the result of user input.  This +   document describes the actions that can be taken by an implementation +   between receiving user input and passing permitted code points to the +   new IDNA protocol. + +Status of This Memo + +   This document is not an Internet Standards Track specification; it is +   published for informational purposes. + +   This is a contribution to the RFC Series, independently of any other +   RFC stream.  The RFC Editor has chosen to publish this document at +   its discretion and makes no statement about its value for +   implementation or deployment.  Documents approved for publication by +   the RFC Editor are not a candidate for any level of Internet +   Standard; see Section 2 of RFC 5741. + +   Information about the current status of this document, any errata, +   and how to provide feedback on it may be obtained at +   http://www.rfc-editor.org/info/rfc5895. + + + + + + + + + + + + +Resnick & Hoffman             Informational                     [Page 1] + +RFC 5895                      IDNA Mapping                September 2010 + + +Copyright Notice + +   Copyright (c) 2010 IETF Trust and the persons identified as the +   document authors.  All rights reserved. + +   This document is subject to BCP 78 and the IETF Trust's Legal +   Provisions Relating to IETF Documents +   (http://trustee.ietf.org/license-info) in effect on the date of +   publication of this document.  Please review these documents +   carefully, as they describe your rights and restrictions with respect +   to this document. + +1.  Introduction + +   This document describes the operations that can be applied to user +   input in order to get it into a form that is acceptable by the +   Internationalized Domain Names in Applications (IDNA) protocol +   [IDNA2008protocol].  It includes a general implementation procedure +   for mapping. + +   It should be noted that this document does not specify the behavior +   of a protocol that appears "on the wire".  It describes an operation +   that is to be applied to user input in order to prepare that user +   input for use in an "on the network" protocol.  As unusual as this +   may be for a document concerning Internet protocols, it is necessary +   to describe this operation for implementors who may have designed +   around the original IDNA protocol (herein referred to as IDNA2003), +   which conflates this user-input operation into the protocol. + +   It is very important to note that there are many potential valid +   mappings of characters from user input.  The mapping described in +   this document is the basis for other mappings, and is not likely to +   be useful without modification.  Any useful mapping will have +   features designed to reduce the surprise for users and is likely to +   be slightly (or sometimes radically) different depending on the +   locale of the user, the type of input being used (such as typing, +   copy-and-paste, voice, and so on), the type of application used, etc. +   Although most common mappings will probably produce similar results +   for the same input, there will be subtle differences between +   applications. + +1.1.  The Dividing Line between User Interface and Protocol + +   The user interface to applications is much more complicated than most +   network implementers think.  When we say "the user enters an +   internationalized domain name in the application", we are talking +   about a very complex process that encompasses everything from the +   user formulating the name and deciding which symbols to use to + + + +Resnick & Hoffman             Informational                     [Page 2] + +RFC 5895                      IDNA Mapping                September 2010 + + +   express that name, to the user entering the symbols into the computer +   using some input method (be it a keyboard, a stylus, or even a voice +   recognition program), to the computer interpreting that input (be it +   keyboard scan codes, a graphical representation, or digitized sounds) +   into some representation of those symbols, through finally +   normalizing those symbols into a particular character repertoire in +   an encoding recognizable to IDNA processes and the domain name +   system. + +   Considerations for a user interface for internationalized domain +   names involves taking into account culture, context, and locale for +   any given user.  A simple and well-known example is the lowercasing +   of the letter LATIN CAPITAL LETTER I (U+0049) when it is used in the +   Turkish and other languages.  A capital "I" in Turkish is properly +   lowercased to a LATIN SMALL LETTER DOTLESS I (U+0131), not to a LATIN +   SMALL LETTER I (U+0069).  This lowercasing is clearly dependent on +   the locale of the system and/or the locale of the user.  Using a +   single context-free mapping without considering the user interface +   properties has the potential of doing exactly the wrong thing for the +   user. + +   The original version of IDNA conflated user interface processing and +   protocol.  It took whatever characters the user produced in whatever +   encoding the application used, assumed some conversion to Unicode +   code points, and then without regard to context, locale, or anything +   about the user's intentions, mapped them into a particular set of +   other characters, and then re-encoded them in Punycode, in order to +   have the entire operation be contained within the protocol.  Ignoring +   context, locale, and user preference in the IDNA protocol made life +   significantly less complicated for the application developer, but at +   the expense of violating the principle of "least user surprise" for +   consumers and producers of domain names. + +   In IDNA2008, the dividing line between "user interface" and +   "protocol" is clear.  The IDNA2008 specification defines the protocol +   part of IDNA: it explicitly does not deal with the user interface. +   Mappings such as the one described in this document explicitly deal +   with the user interface and not the protocol.  That is, a mapping is +   only to be applied before a string of characters is treated as a +   domain name (in the "user interface") and is never to be applied +   during domain name processing (in the "protocol"). + +1.2.  The Design of This Mapping + +   The user interface mapping in this document is a set of expansions to +   IDNA2008 that are meant to be sensible and friendly and mostly +   obvious to people throughout the world when using typical +   applications with domain names that are entered by hand.  It is also + + + +Resnick & Hoffman             Informational                     [Page 3] + +RFC 5895                      IDNA Mapping                September 2010 + + +   designed to let applications be mostly backwards compatible with +   IDNA2003.  By definition, it cannot meet all of those design goals +   for all people, and in fact is known to fail on some of those goals +   for quite large populations of people. + +   A good mapping in the real world might use the "sensible and friendly +   and mostly obvious" design goal but come up with a different +   algorithm.  Many algorithms will have results that are close to what +   is described here, but will differ in assumptions about the users' +   way of thinking or typing.  Having said that, it is likely that some +   mappings will be significantly different.  For example, a mapping +   might apply to a spoken user interface instead of a typed one. +   Another example is that a mapping might be different for users that +   are typing than for users that are copying-and-pasting from different +   applications.  Yet another example is that a user interface that +   allows typed input that is transliterated from Latin characters could +   have very different mappings than one that applies to typing in other +   character sets; this would be typical in a Pinyin input method for +   Chinese characters. + +2.  The General Procedure + +   This section defines a general algorithm that applications ought to +   implement in order to produce Unicode code points that will be valid +   under the IDNA protocol.  An application might implement the full +   mapping as described below, or it can choose a different mapping. +   This mapping is very general and was designed to be acceptable to the +   widest user community, but as stated above, it does not take into +   account any particular context, culture, or locale. + +   The general algorithm that an application (or the input method +   provided by an operating system) ought to use is relatively +   straightforward: + +   1.  Uppercase characters are mapped to their lowercase equivalents by +       using the algorithm for mapping case in Unicode characters.  This +       step was chosen because the output will behave more like ASCII +       host names behave. + +   2.  Fullwidth and halfwidth characters (those defined with +       Decomposition Types <wide> and <narrow>) are mapped to their +       decomposition mappings as shown in the Unicode character +       database.  This step was chosen because many input mechanisms, +       particularly in Asia, do not allow you to easily enter characters +       in the form used by IDNA2008.  Even if they do allow the correct +       character form, the user might not know which form they are +       entering. + + + + +Resnick & Hoffman             Informational                     [Page 4] + +RFC 5895                      IDNA Mapping                September 2010 + + +   3.  All characters are mapped using Unicode Normalization Form C +       (NFC).  This step was chosen because it maps combinations of +       combining characters into canonical composed form.  As with the +       fullwidth/halfwidth mapping, users are not generally aware of the +       particular form of characters that they are entering, and +       IDNA2008 requires that only the canonical composed forms from NFC +       be used. + +   4.  [IDNA2008protocol] is specified such that the protocol acts on +       the individual labels of the domain name.  If an implementation +       of this mapping is also performing the step of separation of the +       parts of a domain name into labels by using the FULL STOP +       character (U+002E), the IDEOGRAPHIC FULL STOP character (U+3002) +       can be mapped to the FULL STOP before label separation occurs. +       There are other characters that are used as "full stops" that one +       could consider mapping as label separators, but their use as such +       has not been investigated thoroughly.  This step was chosen +       because some input mechanisms do not allow the user to easily +       enter proper label separators.  Only the IDEOGRAPHIC FULL STOP +       character (U+3002) is added in this mapping because the authors +       have not fully investigated the applicability of other characters +       and the environments where they should and should not be +       considered domain name label separators. + +   Note that the steps above are ordered. + +   Definitions for the rules in this algorithm can be found in +   [Unicode52].  Specifically: + +   o  Unicode Normalization Form C can be found in Annex #15 of +      [Unicode-UAX15]. + +   o  In order to map uppercase characters to their lowercase +      equivalents (defined in Section 3.13 of [Unicode52]), first map +      characters to the "Lowercase_Mapping" property (the "<lower>" +      entry in the second column) in +      <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any. +      Then, map characters to the "Simple_Lowercase_Mapping" property +      (the fourteenth column) in +      <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any. + +   o  In order to map fullwidth and halfwidth characters to their +      decomposition mappings, map any character whose +      "Decomposition_Type" (contained in the first part of the sixth +      column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt> +      is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of +      that character (contained in the second part of the sixth column) +      in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>. + + + +Resnick & Hoffman             Informational                     [Page 5] + +RFC 5895                      IDNA Mapping                September 2010 + + +   o  The Unicode Character Database [TR44] has useful descriptions of +      the contents of these files. + +   If the mappings in this document are applied to versions of Unicode +   later than Unicode 5.2, the later versions of the Unicode Standard +   should be consulted. + +   These form a minimal set of mappings that an application should +   strongly consider doing.  Of course, there are many others that might +   be done. + +3.  Implementing This Mapping + +   If you are implementing a mapping for an application or operating +   system by using exactly the four steps in Section 2, the authors of +   this document have a request: please don't.  We mean it.  Section 2 +   does not describe a universal mapping algorithm because, as we said, +   there is no universally-applicable mapping algorithm. + +   If you read the material in Section 2 without reading Section 1, go +   back and carefully read all of Section 1; in many ways, Section 1 is +   more important than Section 2.  Further, you can probably think of +   user interface considerations that we did not list in Section 1.  If +   you did read Section 1 but somehow decided that the algorithm in +   Section 2 is completely correct for the intended users of your +   application or operating system, you are probably not thinking hard +   enough about your intended users. + +4.  Security Considerations + +   This document suggests creating mappings that might cause confusion +   for some users while alleviating confusion in other users.  Such +   confusion is not covered in any depth in this document (nor in the +   other IDNA-related documents). + +5.  Acknowledgements + +   This document is the product of many contributions from numerous +   people in the IETF. + + + + + + + + + + + + +Resnick & Hoffman             Informational                     [Page 6] + +RFC 5895                      IDNA Mapping                September 2010 + + +6.  Normative References + +   [IDNA2008protocol]  Klensin, J., "Internationalized Domain Names in +                       Applications (IDNA): Protocol", RFC 5891, +                       August 2010. + +   [TR44]              The Unicode Consortium, "Unicode Technical Report +                       #44: Unicode Character Database", September 2009, +                       <http://www.unicode.org/reports/tr44/ +                       tr44-4.html>. + +   [Unicode-UAX15]     The Unicode Consortium, "Unicode Standard Annex +                       #15: Unicode Normalization Forms, Revision 31", +                       September 2009, <http://www.unicode.org/reports/ +                       tr15/tr15-31.html>. + +   [Unicode52]         The Unicode Consortium.  The Unicode Standard, +                       Version 5.2.0, defined by: "The Unicode Standard, +                       Version 5.2.0", (Mountain View, CA: The Unicode +                       Consortium, 2009. ISBN 978-1-936213-00-9). +                       <http://www.unicode.org/versions/Unicode5.2.0/>. + +Authors' Addresses + +   Peter W. Resnick +   Qualcomm Incorporated +   5775 Morehouse Drive +   San Diego, CA  92121-1714 +   US + +   Phone: +1 858 651 4478 +   EMail: presnick@qualcomm.com +   URI:   http://www.qualcomm.com/~presnick/ + + +   Paul Hoffman +   VPN Consortium +   127 Segre Place +   Santa Cruz, CA  95060 +   US + +   Phone: 1-831-426-9827 +   EMail: paul.hoffman@vpnc.org + + + + + + + + +Resnick & Hoffman             Informational                     [Page 7] +  |