summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc5895.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc5895.txt')
-rw-r--r--doc/rfc/rfc5895.txt395
1 files changed, 395 insertions, 0 deletions
diff --git a/doc/rfc/rfc5895.txt b/doc/rfc/rfc5895.txt
new file mode 100644
index 0000000..446d16a
--- /dev/null
+++ b/doc/rfc/rfc5895.txt
@@ -0,0 +1,395 @@
+
+
+
+
+
+
+Independent Submission P. Resnick
+Request for Comments: 5895 Qualcomm Incorporated
+Category: Informational P. Hoffman
+ISSN: 2070-1721 VPN Consortium
+ September 2010
+
+
+ Mapping Characters for
+ Internationalized Domain Names in Applications (IDNA) 2008
+
+Abstract
+
+ In the original version of the Internationalized Domain Names in
+ Applications (IDNA) protocol, any Unicode code points taken from user
+ input were mapped into a set of Unicode code points that "made
+ sense", and then encoded and passed to the domain name system (DNS).
+ The IDNA2008 protocol (described in RFCs 5890, 5891, 5892, and 5893)
+ presumes that the input to the protocol comes from a set of
+ "permitted" code points, which it then encodes and passes to the DNS,
+ but does not specify what to do with the result of user input. This
+ document describes the actions that can be taken by an implementation
+ between receiving user input and passing permitted code points to the
+ new IDNA protocol.
+
+Status of This Memo
+
+ This document is not an Internet Standards Track specification; it is
+ published for informational purposes.
+
+ This is a contribution to the RFC Series, independently of any other
+ RFC stream. The RFC Editor has chosen to publish this document at
+ its discretion and makes no statement about its value for
+ implementation or deployment. Documents approved for publication by
+ the RFC Editor are not a candidate for any level of Internet
+ Standard; see Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc5895.
+
+
+
+
+
+
+
+
+
+
+
+
+Resnick & Hoffman Informational [Page 1]
+
+RFC 5895 IDNA Mapping September 2010
+
+
+Copyright Notice
+
+ Copyright (c) 2010 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document.
+
+1. Introduction
+
+ This document describes the operations that can be applied to user
+ input in order to get it into a form that is acceptable by the
+ Internationalized Domain Names in Applications (IDNA) protocol
+ [IDNA2008protocol]. It includes a general implementation procedure
+ for mapping.
+
+ It should be noted that this document does not specify the behavior
+ of a protocol that appears "on the wire". It describes an operation
+ that is to be applied to user input in order to prepare that user
+ input for use in an "on the network" protocol. As unusual as this
+ may be for a document concerning Internet protocols, it is necessary
+ to describe this operation for implementors who may have designed
+ around the original IDNA protocol (herein referred to as IDNA2003),
+ which conflates this user-input operation into the protocol.
+
+ It is very important to note that there are many potential valid
+ mappings of characters from user input. The mapping described in
+ this document is the basis for other mappings, and is not likely to
+ be useful without modification. Any useful mapping will have
+ features designed to reduce the surprise for users and is likely to
+ be slightly (or sometimes radically) different depending on the
+ locale of the user, the type of input being used (such as typing,
+ copy-and-paste, voice, and so on), the type of application used, etc.
+ Although most common mappings will probably produce similar results
+ for the same input, there will be subtle differences between
+ applications.
+
+1.1. The Dividing Line between User Interface and Protocol
+
+ The user interface to applications is much more complicated than most
+ network implementers think. When we say "the user enters an
+ internationalized domain name in the application", we are talking
+ about a very complex process that encompasses everything from the
+ user formulating the name and deciding which symbols to use to
+
+
+
+Resnick & Hoffman Informational [Page 2]
+
+RFC 5895 IDNA Mapping September 2010
+
+
+ express that name, to the user entering the symbols into the computer
+ using some input method (be it a keyboard, a stylus, or even a voice
+ recognition program), to the computer interpreting that input (be it
+ keyboard scan codes, a graphical representation, or digitized sounds)
+ into some representation of those symbols, through finally
+ normalizing those symbols into a particular character repertoire in
+ an encoding recognizable to IDNA processes and the domain name
+ system.
+
+ Considerations for a user interface for internationalized domain
+ names involves taking into account culture, context, and locale for
+ any given user. A simple and well-known example is the lowercasing
+ of the letter LATIN CAPITAL LETTER I (U+0049) when it is used in the
+ Turkish and other languages. A capital "I" in Turkish is properly
+ lowercased to a LATIN SMALL LETTER DOTLESS I (U+0131), not to a LATIN
+ SMALL LETTER I (U+0069). This lowercasing is clearly dependent on
+ the locale of the system and/or the locale of the user. Using a
+ single context-free mapping without considering the user interface
+ properties has the potential of doing exactly the wrong thing for the
+ user.
+
+ The original version of IDNA conflated user interface processing and
+ protocol. It took whatever characters the user produced in whatever
+ encoding the application used, assumed some conversion to Unicode
+ code points, and then without regard to context, locale, or anything
+ about the user's intentions, mapped them into a particular set of
+ other characters, and then re-encoded them in Punycode, in order to
+ have the entire operation be contained within the protocol. Ignoring
+ context, locale, and user preference in the IDNA protocol made life
+ significantly less complicated for the application developer, but at
+ the expense of violating the principle of "least user surprise" for
+ consumers and producers of domain names.
+
+ In IDNA2008, the dividing line between "user interface" and
+ "protocol" is clear. The IDNA2008 specification defines the protocol
+ part of IDNA: it explicitly does not deal with the user interface.
+ Mappings such as the one described in this document explicitly deal
+ with the user interface and not the protocol. That is, a mapping is
+ only to be applied before a string of characters is treated as a
+ domain name (in the "user interface") and is never to be applied
+ during domain name processing (in the "protocol").
+
+1.2. The Design of This Mapping
+
+ The user interface mapping in this document is a set of expansions to
+ IDNA2008 that are meant to be sensible and friendly and mostly
+ obvious to people throughout the world when using typical
+ applications with domain names that are entered by hand. It is also
+
+
+
+Resnick & Hoffman Informational [Page 3]
+
+RFC 5895 IDNA Mapping September 2010
+
+
+ designed to let applications be mostly backwards compatible with
+ IDNA2003. By definition, it cannot meet all of those design goals
+ for all people, and in fact is known to fail on some of those goals
+ for quite large populations of people.
+
+ A good mapping in the real world might use the "sensible and friendly
+ and mostly obvious" design goal but come up with a different
+ algorithm. Many algorithms will have results that are close to what
+ is described here, but will differ in assumptions about the users'
+ way of thinking or typing. Having said that, it is likely that some
+ mappings will be significantly different. For example, a mapping
+ might apply to a spoken user interface instead of a typed one.
+ Another example is that a mapping might be different for users that
+ are typing than for users that are copying-and-pasting from different
+ applications. Yet another example is that a user interface that
+ allows typed input that is transliterated from Latin characters could
+ have very different mappings than one that applies to typing in other
+ character sets; this would be typical in a Pinyin input method for
+ Chinese characters.
+
+2. The General Procedure
+
+ This section defines a general algorithm that applications ought to
+ implement in order to produce Unicode code points that will be valid
+ under the IDNA protocol. An application might implement the full
+ mapping as described below, or it can choose a different mapping.
+ This mapping is very general and was designed to be acceptable to the
+ widest user community, but as stated above, it does not take into
+ account any particular context, culture, or locale.
+
+ The general algorithm that an application (or the input method
+ provided by an operating system) ought to use is relatively
+ straightforward:
+
+ 1. Uppercase characters are mapped to their lowercase equivalents by
+ using the algorithm for mapping case in Unicode characters. This
+ step was chosen because the output will behave more like ASCII
+ host names behave.
+
+ 2. Fullwidth and halfwidth characters (those defined with
+ Decomposition Types <wide> and <narrow>) are mapped to their
+ decomposition mappings as shown in the Unicode character
+ database. This step was chosen because many input mechanisms,
+ particularly in Asia, do not allow you to easily enter characters
+ in the form used by IDNA2008. Even if they do allow the correct
+ character form, the user might not know which form they are
+ entering.
+
+
+
+
+Resnick & Hoffman Informational [Page 4]
+
+RFC 5895 IDNA Mapping September 2010
+
+
+ 3. All characters are mapped using Unicode Normalization Form C
+ (NFC). This step was chosen because it maps combinations of
+ combining characters into canonical composed form. As with the
+ fullwidth/halfwidth mapping, users are not generally aware of the
+ particular form of characters that they are entering, and
+ IDNA2008 requires that only the canonical composed forms from NFC
+ be used.
+
+ 4. [IDNA2008protocol] is specified such that the protocol acts on
+ the individual labels of the domain name. If an implementation
+ of this mapping is also performing the step of separation of the
+ parts of a domain name into labels by using the FULL STOP
+ character (U+002E), the IDEOGRAPHIC FULL STOP character (U+3002)
+ can be mapped to the FULL STOP before label separation occurs.
+ There are other characters that are used as "full stops" that one
+ could consider mapping as label separators, but their use as such
+ has not been investigated thoroughly. This step was chosen
+ because some input mechanisms do not allow the user to easily
+ enter proper label separators. Only the IDEOGRAPHIC FULL STOP
+ character (U+3002) is added in this mapping because the authors
+ have not fully investigated the applicability of other characters
+ and the environments where they should and should not be
+ considered domain name label separators.
+
+ Note that the steps above are ordered.
+
+ Definitions for the rules in this algorithm can be found in
+ [Unicode52]. Specifically:
+
+ o Unicode Normalization Form C can be found in Annex #15 of
+ [Unicode-UAX15].
+
+ o In order to map uppercase characters to their lowercase
+ equivalents (defined in Section 3.13 of [Unicode52]), first map
+ characters to the "Lowercase_Mapping" property (the "<lower>"
+ entry in the second column) in
+ <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any.
+ Then, map characters to the "Simple_Lowercase_Mapping" property
+ (the fourteenth column) in
+ <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.
+
+ o In order to map fullwidth and halfwidth characters to their
+ decomposition mappings, map any character whose
+ "Decomposition_Type" (contained in the first part of the sixth
+ column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>
+ is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of
+ that character (contained in the second part of the sixth column)
+ in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>.
+
+
+
+Resnick & Hoffman Informational [Page 5]
+
+RFC 5895 IDNA Mapping September 2010
+
+
+ o The Unicode Character Database [TR44] has useful descriptions of
+ the contents of these files.
+
+ If the mappings in this document are applied to versions of Unicode
+ later than Unicode 5.2, the later versions of the Unicode Standard
+ should be consulted.
+
+ These form a minimal set of mappings that an application should
+ strongly consider doing. Of course, there are many others that might
+ be done.
+
+3. Implementing This Mapping
+
+ If you are implementing a mapping for an application or operating
+ system by using exactly the four steps in Section 2, the authors of
+ this document have a request: please don't. We mean it. Section 2
+ does not describe a universal mapping algorithm because, as we said,
+ there is no universally-applicable mapping algorithm.
+
+ If you read the material in Section 2 without reading Section 1, go
+ back and carefully read all of Section 1; in many ways, Section 1 is
+ more important than Section 2. Further, you can probably think of
+ user interface considerations that we did not list in Section 1. If
+ you did read Section 1 but somehow decided that the algorithm in
+ Section 2 is completely correct for the intended users of your
+ application or operating system, you are probably not thinking hard
+ enough about your intended users.
+
+4. Security Considerations
+
+ This document suggests creating mappings that might cause confusion
+ for some users while alleviating confusion in other users. Such
+ confusion is not covered in any depth in this document (nor in the
+ other IDNA-related documents).
+
+5. Acknowledgements
+
+ This document is the product of many contributions from numerous
+ people in the IETF.
+
+
+
+
+
+
+
+
+
+
+
+
+Resnick & Hoffman Informational [Page 6]
+
+RFC 5895 IDNA Mapping September 2010
+
+
+6. Normative References
+
+ [IDNA2008protocol] Klensin, J., "Internationalized Domain Names in
+ Applications (IDNA): Protocol", RFC 5891,
+ August 2010.
+
+ [TR44] The Unicode Consortium, "Unicode Technical Report
+ #44: Unicode Character Database", September 2009,
+ <http://www.unicode.org/reports/tr44/
+ tr44-4.html>.
+
+ [Unicode-UAX15] The Unicode Consortium, "Unicode Standard Annex
+ #15: Unicode Normalization Forms, Revision 31",
+ September 2009, <http://www.unicode.org/reports/
+ tr15/tr15-31.html>.
+
+ [Unicode52] The Unicode Consortium. The Unicode Standard,
+ Version 5.2.0, defined by: "The Unicode Standard,
+ Version 5.2.0", (Mountain View, CA: The Unicode
+ Consortium, 2009. ISBN 978-1-936213-00-9).
+ <http://www.unicode.org/versions/Unicode5.2.0/>.
+
+Authors' Addresses
+
+ Peter W. Resnick
+ Qualcomm Incorporated
+ 5775 Morehouse Drive
+ San Diego, CA 92121-1714
+ US
+
+ Phone: +1 858 651 4478
+ EMail: presnick@qualcomm.com
+ URI: http://www.qualcomm.com/~presnick/
+
+
+ Paul Hoffman
+ VPN Consortium
+ 127 Segre Place
+ Santa Cruz, CA 95060
+ US
+
+ Phone: 1-831-426-9827
+ EMail: paul.hoffman@vpnc.org
+
+
+
+
+
+
+
+
+Resnick & Hoffman Informational [Page 7]
+