summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc5894.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc5894.txt')
-rw-r--r--doc/rfc/rfc5894.txt2411
1 files changed, 2411 insertions, 0 deletions
diff --git a/doc/rfc/rfc5894.txt b/doc/rfc/rfc5894.txt
new file mode 100644
index 0000000..1d664e1
--- /dev/null
+++ b/doc/rfc/rfc5894.txt
@@ -0,0 +1,2411 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) J. Klensin
+Request for Comments: 5894 August 2010
+Category: Informational
+ISSN: 2070-1721
+
+
+ Internationalized Domain Names for Applications (IDNA):
+ Background, Explanation, and Rationale
+
+Abstract
+
+ Several years have passed since the original protocol for
+ Internationalized Domain Names (IDNs) was completed and deployed.
+ During that time, a number of issues have arisen, including the need
+ to update the system to deal with newer versions of Unicode. Some of
+ these issues require tuning of the existing protocols and the tables
+ on which they depend. This document provides an overview of a
+ revised system and provides explanatory material for its components.
+
+Status of This Memo
+
+ This document is not an Internet Standards Track specification; it is
+ published for informational purposes.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Not all documents
+ approved by the IESG are a candidate for any level of Internet
+ Standard; see Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc5894.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Klensin Informational [Page 1]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+Copyright Notice
+
+ Copyright (c) 2010 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+ This document may contain material from IETF Documents or IETF
+ Contributions published or made publicly available before November
+ 10, 2008. The person(s) controlling the copyright in some of this
+ material may not have granted the IETF Trust the right to allow
+ modifications of such material outside the IETF Standards Process.
+ Without obtaining an adequate license from the person(s) controlling
+ the copyright in such materials, this document may not be modified
+ outside the IETF Standards Process, and derivative works of it may
+ not be created outside the IETF Standards Process, except to format
+ it for publication as an RFC or to translate it into languages other
+ than English.
+
+Table of Contents
+
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
+ 1.1. Context and Overview . . . . . . . . . . . . . . . . . . . 4
+ 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5
+ 1.2.1. DNS "Name" Terminology . . . . . . . . . . . . . . . . 5
+ 1.2.2. New Terminology and Restrictions . . . . . . . . . . . 6
+ 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 6
+ 1.4. Applicability and Function of IDNA . . . . . . . . . . . . 7
+ 1.5. Comprehensibility of IDNA Mechanisms and Processing . . . 8
+ 2. Processing in IDNA2008 . . . . . . . . . . . . . . . . . . . . 9
+ 3. Permitted Characters: An Inclusion List . . . . . . . . . . . 9
+ 3.1. A Tiered Model of Permitted Characters and Labels . . . . 10
+ 3.1.1. PROTOCOL-VALID . . . . . . . . . . . . . . . . . . . . 10
+ 3.1.2. CONTEXTUAL RULE REQUIRED . . . . . . . . . . . . . . . 11
+ 3.1.2.1. Contextual Restrictions . . . . . . . . . . . . . 11
+ 3.1.2.2. Rules and Their Application . . . . . . . . . . . 12
+ 3.1.3. DISALLOWED . . . . . . . . . . . . . . . . . . . . . . 12
+ 3.1.4. UNASSIGNED . . . . . . . . . . . . . . . . . . . . . . 13
+ 3.2. Registration Policy . . . . . . . . . . . . . . . . . . . 14
+
+
+
+
+Klensin Informational [Page 2]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ 3.3. Layered Restrictions: Tables, Context, Registration, and
+ Applications . . . . . . . . . . . . . . . . . . . . . . . 15
+ 4. Application-Related Issues . . . . . . . . . . . . . . . . . . 15
+ 4.1. Display and Network Order . . . . . . . . . . . . . . . . 15
+ 4.2. Entry and Display in Applications . . . . . . . . . . . . 16
+ 4.3. Linguistic Expectations: Ligatures, Digraphs, and
+ Alternate Character Forms . . . . . . . . . . . . . . . . 19
+ 4.4. Case Mapping and Related Issues . . . . . . . . . . . . . 20
+ 4.5. Right-to-Left Text . . . . . . . . . . . . . . . . . . . . 21
+ 5. IDNs and the Robustness Principle . . . . . . . . . . . . . . 22
+ 6. Front-end and User Interface Processing for Lookup . . . . . . 22
+ 7. Migration from IDNA2003 and Unicode Version Synchronization . 25
+ 7.1. Design Criteria . . . . . . . . . . . . . . . . . . . . . 25
+ 7.1.1. Summary and Discussion of IDNA Validity Criteria . . . 25
+ 7.1.2. Labels in Registration . . . . . . . . . . . . . . . . 26
+ 7.1.3. Labels in Lookup . . . . . . . . . . . . . . . . . . . 27
+ 7.2. Changes in Character Interpretations . . . . . . . . . . . 28
+ 7.2.1. Character Changes: Eszett and Final Sigma . . . . . . 28
+ 7.2.2. Character Changes: Zero Width Joiner and Zero
+ Width Non-Joiner . . . . . . . . . . . . . . . . . . . 29
+ 7.2.3. Character Changes and the Need for Transition . . . . 29
+ 7.2.4. Transition Strategies . . . . . . . . . . . . . . . . 30
+ 7.3. Elimination of Character Mapping . . . . . . . . . . . . . 31
+ 7.4. The Question of Prefix Changes . . . . . . . . . . . . . . 31
+ 7.4.1. Conditions Requiring a Prefix Change . . . . . . . . . 31
+ 7.4.2. Conditions Not Requiring a Prefix Change . . . . . . . 32
+ 7.4.3. Implications of Prefix Changes . . . . . . . . . . . . 32
+ 7.5. Stringprep Changes and Compatibility . . . . . . . . . . . 33
+ 7.6. The Symbol Question . . . . . . . . . . . . . . . . . . . 33
+ 7.7. Migration between Unicode Versions: Unassigned Code
+ Points . . . . . . . . . . . . . . . . . . . . . . . . . . 35
+ 7.8. Other Compatibility Issues . . . . . . . . . . . . . . . . 36
+ 8. Name Server Considerations . . . . . . . . . . . . . . . . . . 37
+ 8.1. Processing Non-ASCII Strings . . . . . . . . . . . . . . . 37
+ 8.2. Root and Other DNS Server Considerations . . . . . . . . . 37
+ 9. Internationalization Considerations . . . . . . . . . . . . . 38
+ 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38
+ 10.1. IDNA Character Registry . . . . . . . . . . . . . . . . . 38
+ 10.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 39
+ 10.3. IANA Repository of IDN Practices of TLDs . . . . . . . . . 39
+ 11. Security Considerations . . . . . . . . . . . . . . . . . . . 39
+ 11.1. General Security Issues with IDNA . . . . . . . . . . . . 39
+ 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 39
+ 13. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 40
+ 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
+ 14.1. Normative References . . . . . . . . . . . . . . . . . . . 40
+ 14.2. Informative References . . . . . . . . . . . . . . . . . . 41
+
+
+
+
+Klensin Informational [Page 3]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+1. Introduction
+
+1.1. Context and Overview
+
+ Internationalized Domain Names in Applications (IDNA) is a collection
+ of standards that allow client applications to convert some mnemonic
+ strings expressed in Unicode to an ASCII-compatible encoding form
+ ("ACE") that is a valid DNS label containing only LDH syntax (see the
+ Definitions document [RFC5890]). The specific form of ACE label used
+ by IDNA is called an "A-label". A client can look up an exact
+ A-label in the existing DNS, so A-labels do not require any
+ extensions to DNS, upgrades of DNS servers, or updates to low-level
+ client libraries. An A-label is recognizable from the prefix "xn--"
+ before the characters produced by the Punycode algorithm [RFC3492];
+ thus, a user application can identify an A-label and convert it into
+ Unicode (or some local coded character set) for display.
+
+ On the registry side, IDNA allows a registry to offer
+ Internationalized Domain Names (IDNs) for registration as A-labels.
+ A registry may offer any subset of valid IDNs, and may apply any
+ restrictions or bundling (grouping of similar labels together in one
+ registration) appropriate for the context of that registry.
+ Registration of labels is sometimes discussed separately from lookup,
+ and it is subject to a few specific requirements that do not apply to
+ lookup.
+
+ DNS clients and registries are subject to some differences in
+ requirements for handling IDNs. In particular, registries are urged
+ to register only exact, valid A-labels, while clients might do some
+ mapping to get from otherwise-invalid user input to a valid A-label.
+
+ The first version of IDNA was published in 2003 and is referred to
+ here as IDNA2003 to contrast it with the current version, which is
+ known as IDNA2008 (after the year in which IETF work started on it).
+ IDNA2003 consists of four documents: the IDNA base specification
+ [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep
+ [RFC3454]. The current set of documents, IDNA2008, is not dependent
+ on any of the IDNA2003 specifications other than the one for Punycode
+ encoding. References to "IDNA2008", "these specifications", or
+ "these documents" are to the entire IDNA2008 set listed in a separate
+ Definitions document [RFC5890]. The characters that are valid in
+ A-labels are identified from rules listed in the Tables document
+ [RFC5892], but validity can be derived from the Unicode properties of
+ those characters with a very few exceptions.
+
+ Traditionally, DNS labels are matched case-insensitively (as
+ described in the DNS specifications [RFC1034][RFC1035]). That
+ convention was preserved in IDNA2003 by a case-folding operation that
+
+
+
+Klensin Informational [Page 4]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ generally maps capital letters into lowercase ones. However, if case
+ rules are enforced from one language, another language sometimes
+ loses the ability to treat two characters separately. Case-
+ insensitivity is treated slightly differently in IDNA2008.
+
+ IDNA2003 used Unicode version 3.2 only. In order to keep up with new
+ characters added in new versions of Unicode, IDNA2008 decouples its
+ rules from any particular version of Unicode. Instead, the
+ attributes of new characters in Unicode, supplemented by a small
+ number of exception cases, determine how and whether the characters
+ can be used in IDNA labels.
+
+ This document provides informational context for IDNA2008, including
+ terminology, background, and policy discussions. It contains no
+ normative material; specifications for conformance to the IDNA2008
+ protocols appears entirely in the other documents in the series.
+
+1.2. Terminology
+
+ Terminology for IDNA2008 appears in the Definitions document
+ [RFC5890]. That document also contains a road map to the IDNA2008
+ document collection. No attempt should be made to understand this
+ document without the definitions and concepts that appear there.
+
+1.2.1. DNS "Name" Terminology
+
+ In the context of IDNs, the DNS term "name" has introduced some
+ confusion as people speak of DNS labels in terms of the words or
+ phrases of various natural languages. Historically, many of the
+ "names" in the DNS have been mnemonics to identify some particular
+ concept, object, or organization. They are typically rooted in some
+ language because most people think in language-based ways. But,
+ because they are mnemonics, they need not obey the orthographic
+ conventions of any language: it is not a requirement that it be
+ possible for them to be "words".
+
+ This distinction is important because the reasonable goal of an IDN
+ effort is not to be able to write the great Klingon (or language of
+ one's choice) novel in DNS labels but to be able to form a usefully
+ broad range of mnemonics in ways that are as natural as possible in a
+ very broad range of scripts.
+
+
+
+
+
+
+
+
+
+
+Klensin Informational [Page 5]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+1.2.2. New Terminology and Restrictions
+
+ IDNA2008 introduces new terminology. Precise definitions are
+ provided in the Definitions document for the terms U-label, A-Label,
+ LDH label (to which all valid pre-IDNA hostnames conformed), Reserved
+ LDH label (R-LDH label), XN-label, Fake A-label, and Non-Reserved LDH
+ label (NR-LDH label).
+
+ In addition, the term "putative label" has been adopted to refer to a
+ label that may appear to meet certain definitional constraints but
+ has not yet been sufficiently tested for validity.
+
+ These definitions are also illustrated in Figure 1 of the Definitions
+ document. R-LDH labels contain "--" in the third and fourth
+ character positions from the beginning of the label. In IDNA-aware
+ applications, only a subset of these reserved labels is permitted to
+ be used, namely the A-label subset. A-labels are a subset of the
+ R-LDH labels that begin with the case-insensitive string "xn--".
+ Labels that bear this prefix but that are not otherwise valid fall
+ into the "Fake A-label" category. The Non-Reserved labels (NR-LDH
+ labels) are implicitly valid since they do not bear any resemblance
+ to the labels specified by IDNA.
+
+ The creation of the Reserved-LDH category is required for three
+ reasons:
+
+ o to prevent confusion with pre-IDNA coding forms;
+
+ o to permit future extensions that would require changing the
+ prefix, no matter how unlikely those might be (see Section 7.4);
+ and
+
+ o to reduce the opportunities for attacks via the Punycode encoding
+ algorithm itself.
+
+ As with other documents in the IDNA2008 set, this document uses the
+ term "registry" to describe any zone in the DNS. That term, and the
+ terms "zone" or "zone administration", are interchangeable.
+
+1.3. Objectives
+
+ These are the main objectives in revising IDNA.
+
+ o Use a more recent version of Unicode and allow IDNA to be
+ independent of Unicode versions, so that IDNA2008 need not be
+ updated for implementations to adopt code points from new Unicode
+ versions.
+
+
+
+
+Klensin Informational [Page 6]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ o Fix a very small number of code point categorizations that have
+ turned out to cause problems in the communities that use those
+ code points.
+
+ o Reduce the dependency on mapping, in favor of valid A-labels.
+ This will result in pre-mapped forms that are not valid IDNA
+ labels appearing less often in various contexts.
+
+ o Fix some details in the bidirectional code point handling
+ algorithms.
+
+1.4. Applicability and Function of IDNA
+
+ The IDNA specification solves the problem of extending the repertoire
+ of characters that can be used in domain names to include a large
+ subset of the Unicode repertoire.
+
+ IDNA does not extend DNS. Instead, the applications (and, by
+ implication, the users) continue to see an exact-match lookup
+ service. Either there is a single name that matches exactly (subject
+ to the base DNS requirement of case-insensitive ASCII matching) or
+ there is no match. This model has served the existing applications
+ well, but it requires, with or without internationalized domain
+ names, that users know the exact spelling of the domain names that
+ are to be typed into applications such as web browsers and mail user
+ agents. The introduction of the larger repertoire of characters
+ potentially makes the set of misspellings larger, especially given
+ that in some cases the same appearance, for example on a business
+ card, might visually match several Unicode code points or several
+ sequences of code points.
+
+ The IDNA standard does not require any applications to conform to it,
+ nor does it retroactively change those applications. An application
+ can elect to use IDNA in order to support IDNs while maintaining
+ interoperability with existing infrastructure. For applications that
+ want to use non-ASCII characters in public DNS domain names, IDNA is
+ the only option that is defined at the time this specification is
+ published. Adding IDNA support to an existing application entails
+ changes to the application only, and leaves room for flexibility in
+ front-end processing and more specifically in the user interface (see
+ Section 6).
+
+ A great deal of the discussion of IDN solutions has focused on
+ transition issues and how IDNs will work in a world where not all of
+ the components have been updated. Proposals that were not chosen by
+ the original IDN Working Group would have depended on updating user
+ applications, DNS resolvers, and DNS servers in order for a user to
+ apply an internationalized domain name in any form or coding
+
+
+
+Klensin Informational [Page 7]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ acceptable under that method. While processing must be performed
+ prior to or after access to the DNS, IDNA requires no changes to the
+ DNS protocol, any DNS servers, or the resolvers on users' computers.
+
+ IDNA allows the graceful introduction of IDNs not only by avoiding
+ upgrades to existing infrastructure (such as DNS servers and mail
+ transport agents), but also by allowing some limited use of IDNs in
+ applications by using the ASCII-encoded representation of the labels
+ containing non-ASCII characters. While such names are user-
+ unfriendly to read and type, and hence not optimal for user input,
+ they can be used as a last resort to allow rudimentary IDN usage.
+ For example, they might be the best choice for display if it were
+ known that relevant fonts were not available on the user's computer.
+ In order to allow user-friendly input and output of the IDNs and
+ acceptance of some characters as equivalent to those to be processed
+ according to the protocol, the applications need to be modified to
+ conform to this specification.
+
+ This version of IDNA uses the Unicode character repertoire for
+ continuity with the original version of IDNA.
+
+1.5. Comprehensibility of IDNA Mechanisms and Processing
+
+ One goal of IDNA2008, which is aided by the main goal of reducing the
+ dependency on mapping, is to improve the general understanding of how
+ IDNA works and what characters are permitted and what happens to
+ them. Comprehensibility and predictability to users and registrants
+ are important design goals for this effort. End-user applications
+ have an important role to play in increasing this comprehensibility.
+
+ Any system that tries to handle international characters encounters
+ some common problems. For example, a User Interface (UI) cannot
+ display a character if no font containing that character is
+ available. In some cases, internationalization enables effective
+ localization while maintaining some global uniformity but losing some
+ universality.
+
+ It is difficult to even make suggestions as to how end-user
+ applications should cope when characters and fonts are not available.
+ Because display functions are rarely controlled by the types of
+ applications that would call upon IDNA, such suggestions will rarely
+ be very effective.
+
+ Conversion between local character sets and normalized Unicode, if
+ needed, is part of this set of user interface issues. Those
+ conversions introduce complexity in a system that does not use
+ Unicode as its primary (or only) internal character coding system.
+ If a label is converted to a local character set that does not have
+
+
+
+Klensin Informational [Page 8]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ all the needed characters, or that uses different character-coding
+ principles, the user interface program may have to add special logic
+ to avoid or reduce loss of information.
+
+ The major difficulty may lie in accurately identifying the incoming
+ character set and applying the correct conversion routine. Even more
+ difficult, the local character coding system could be based on
+ conceptually different assumptions than those used by Unicode (e.g.,
+ choice of font encodings used for publications in some Indic
+ scripts). Those differences may not easily yield unambiguous
+ conversions or interpretations even if each coding system is
+ internally consistent and adequate to represent the local language
+ and script.
+
+ IDNA2008 shifts responsibility for character mapping and other
+ adjustments from the protocol (where it was located in IDNA2003) to
+ pre-processing before invoking IDNA itself. The intent is that this
+ change will lead to greater usage of fully-valid A-Labels or U-labels
+ in display, transit, and storage, which should aid comprehensibility
+ and predictability. A careful look at pre-processing raises issues
+ about what that pre-processing should do and at what point
+ pre-processing becomes harmful; how universally consistent
+ pre-processing algorithms can be; and how to be compatible with
+ labels prepared in an IDNA2003 context. Those issues are discussed
+ in Section 6 and in the Mapping document [IDNA2008-Mapping].
+
+2. Processing in IDNA2008
+
+ IDNA2008 separates Domain Name Registration and Lookup in the
+ protocol specification (RFC 5891, Sections 4 and 5 [RFC5891]).
+ Although most steps in the two processes are similar, the separation
+ reflects current practice in which per-registry (DNS zone)
+ restrictions and special processing are applied at registration time
+ but not during lookup. Another significant benefit is that
+ separation facilitates incremental addition of permitted character
+ groups to avoid freezing on one particular version of Unicode.
+
+ The actual registration and lookup protocols for IDNA2008 are
+ specified in the Protocol document.
+
+3. Permitted Characters: An Inclusion List
+
+ IDNA2008 adopts the inclusion model. A code point is assumed to be
+ invalid for IDN use unless it is included as part of a Unicode
+ property-based rule or, in rare cases, included individually by an
+ exception. When an implementation moves to a new version of Unicode,
+ the rules may indicate new valid code points.
+
+
+
+
+Klensin Informational [Page 9]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ This section provides an overview of the model used to establish the
+ algorithm and character lists of the Tables document [RFC5892] and
+ describes the names and applicability of the categories used there.
+ Note that the inclusion of a character in the PROTOCOL-VALID category
+ group (Section 3.1.1) does not imply that it can be used
+ indiscriminately; some characters are associated with contextual
+ rules that must be applied as well.
+
+ The information given in this section is provided to make the rules,
+ tables, and protocol easier to understand. The normative generating
+ rules that correspond to this informal discussion appear in the
+ Tables document, and the rules that actually determine what labels
+ can be registered or looked up are in the Protocol document.
+
+3.1. A Tiered Model of Permitted Characters and Labels
+
+ Moving to an inclusion model involves a new specification for the
+ list of characters that are permitted in IDNs. In IDNA2003,
+ character validity is independent of context and fixed forever (or
+ until the standard is replaced). However, globally context-
+ independent rules have proved to be impractical because some
+ characters, especially those that are called "Join_Controls" in
+ Unicode, are needed to make reasonable use of some scripts but have
+ no visible effect in others. IDNA2003 prohibited those types of
+ characters entirely by discarding them. We now have a consensus that
+ under some conditions, these "joiner" characters are legitimately
+ needed to allow useful mnemonics for some languages and scripts. In
+ general, context-dependent rules help deal with characters (generally
+ characters that would otherwise be prohibited entirely) that are used
+ differently or perceived differently across different scripts, and
+ allow the standard to be applied more appropriately in cases where a
+ string is not universally handled the same way.
+
+ IDNA2008 divides all possible Unicode code points into four
+ categories: PROTOCOL-VALID, CONTEXTUAL RULE REQUIRED, DISALLOWED, and
+ UNASSIGNED.
+
+3.1.1. PROTOCOL-VALID
+
+ Characters identified as PROTOCOL-VALID (often abbreviated PVALID)
+ are permitted in IDNs. Their use may be restricted by rules about
+ the context in which they appear or by other rules that apply to the
+ entire label in which they are to be embedded. For example, any
+ label that contains a character in this category that has a
+ "right-to-left" property must be used in context with the Bidi rules
+ [RFC5893]. The term PROTOCOL-VALID is used to stress the fact that
+ the presence of a character in this category does not imply that a
+ given registry need accept registrations containing any of the
+
+
+
+Klensin Informational [Page 10]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ characters in the category. Registries are still expected to apply
+ judgment about labels they will accept and to maintain rules
+ consistent with those judgments (see the Protocol document [RFC5891]
+ and Section 3.3).
+
+ Characters that are placed in the PROTOCOL-VALID category are
+ expected to never be removed from it or reclassified. While
+ theoretically characters could be removed from Unicode, such removal
+ would be inconsistent with the Unicode stability principles (see
+ UTR 39: Unicode Security Mechanisms [Unicode52], Appendix F) and
+ hence should never occur.
+
+3.1.2. CONTEXTUAL RULE REQUIRED
+
+ Some characters may be unsuitable for general use in IDNs but
+ necessary for the plausible support of some scripts. The two most
+ commonly cited examples are the ZERO WIDTH JOINER and ZERO WIDTH
+ NON-JOINER characters (ZWJ, U+200D and ZWNJ, U+200C), but other
+ characters may require special treatment because they would otherwise
+ be DISALLOWED (typically because Unicode considers them punctuation
+ or special symbols) but need to be permitted in limited contexts.
+ Other characters are given this special treatment because they pose
+ exceptional danger of being used to produce misleading labels or to
+ cause unacceptable ambiguity in label matching and interpretation.
+
+3.1.2.1. Contextual Restrictions
+
+ Characters with contextual restrictions are identified as CONTEXTUAL
+ RULE REQUIRED and are associated with a rule. The rule defines
+ whether the character is valid in a particular string, and also
+ whether the rule itself is to be applied on lookup as well as
+ registration.
+
+ A distinction is made between characters that indicate or prohibit
+ joining and ones similar to them (known as CONTEXT-JOINER or
+ CONTEXTJ) and other characters requiring contextual treatment
+ (CONTEXT-OTHER or CONTEXTO). Only the former require full testing at
+ lookup time.
+
+ It is important to note that these contextual rules cannot prevent
+ all uses of the relevant characters that might be confusing or
+ problematic. What they are expected to do is to confine
+ applicability of the characters to scripts (and narrower contexts)
+ where zone administrators are knowledgeable enough about the use of
+ those characters to be prepared to deal with them appropriately.
+
+
+
+
+
+
+Klensin Informational [Page 11]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ For example, a registry dealing with an Indic script that requires
+ ZWJ and/or ZWNJ as part of the writing system is expected to
+ understand where the characters have visible effect and where they do
+ not and to make registration rules accordingly. By contrast, a
+ registry dealing primarily with Latin or Cyrillic script might not be
+ actively aware that the characters exist, much less about the
+ consequences of embedding them in labels drawn from those scripts and
+ therefore should avoid accepting registrations containing those
+ characters, at least in labels using characters from the Latin or
+ Cyrillic scripts.
+
+3.1.2.2. Rules and Their Application
+
+ Rules have descriptions such as "Must follow a character from Script
+ XYZ", "Must occur only if the entire label is in Script ABC", or
+ "Must occur only if the previous and subsequent characters have the
+ DFG property". The actual rules may be DEFINED or NULL. If present,
+ they may have values of "True" (character may be used in any position
+ in any label), "False" (character may not be used in any label), or
+ may be a set of procedural rules that specify the context in which
+ the character is permitted.
+
+ Because it is easier to identify these characters than to know that
+ they are actually needed in IDNs or how to establish exactly the
+ right rules for each one, a rule may have a null value in a given
+ version of the tables. Characters associated with null rules are not
+ permitted to appear in putative labels for either registration or
+ lookup. Of course, a later version of the tables might contain a
+ non-null rule.
+
+ The actual rules and their descriptions are in Sections 2 and 3 of
+ the Tables document [RFC5892]. That document also specifies the
+ creation of a registry for future rules.
+
+3.1.3. DISALLOWED
+
+ Some characters are inappropriate for use in IDNs and are thus
+ excluded for both registration and lookup (i.e., IDNA-conforming
+ applications performing name lookup should verify that these
+ characters are absent; if they are present, the label strings should
+ be rejected rather than converted to A-labels and looked up. Some of
+ these characters are problematic for use in IDNs (such as the
+ FRACTION SLASH character, U+2044), while some of them (such as the
+ various HEART symbols, e.g., U+2665, U+2661, and U+2765, see
+ Section 7.6) simply fall outside the conventions for typical
+ identifiers (basically letters and numbers).
+
+
+
+
+
+Klensin Informational [Page 12]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ Of course, this category would include code points that had been
+ removed entirely from Unicode should such removals ever occur.
+
+ Characters that are placed in the DISALLOWED category are expected to
+ never be removed from it or reclassified. If a character is
+ classified as DISALLOWED in error and the error is sufficiently
+ problematic, the only recourse would be either to introduce a new
+ code point into Unicode and classify it as PROTOCOL-VALID or for the
+ IETF to accept the considerable costs of an incompatible change and
+ replace the relevant RFC with one containing appropriate exceptions.
+
+ There is provision for exception cases but, in general, characters
+ are placed into DISALLOWED if they fall into one or more of the
+ following groups:
+
+ o The character is a compatibility equivalent for another character.
+ In slightly more precise Unicode terms, application of
+ Normalization Form KC (NFKC) to the character yields some other
+ character.
+
+ o The character is an uppercase form or some other form that is
+ mapped to another character by Unicode case folding.
+
+ o The character is a symbol or punctuation form or, more generally,
+ something that is not a letter, digit, or a mark that is used to
+ form a letter or digit.
+
+3.1.4. UNASSIGNED
+
+ For convenience in processing and table-building, code points that do
+ not have assigned values in a given version of Unicode are treated as
+ belonging to a special UNASSIGNED category. Such code points are
+ prohibited in labels to be registered or looked up. The category
+ differs from DISALLOWED in that code points are moved out of it by
+ the simple expedient of being assigned in a later version of Unicode
+ (at which point, they are classified into one of the other categories
+ as appropriate).
+
+ The rationale for restricting the processing of UNASSIGNED characters
+ is simply that the properties of such code points cannot be
+ completely known until actual characters are assigned to them. For
+ example, assume that an UNASSIGNED code point were included in a
+ label to be looked up. Assume that the code point was later assigned
+ to a character that required some set of contextual rules. With that
+ combination, un-updated instances of IDNA-aware software might permit
+ lookup of labels containing the previously unassigned characters
+ while updated versions of the software might restrict use of the same
+
+
+
+
+Klensin Informational [Page 13]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ label in lookup, depending on the contextual rules. It should be
+ clear that under no circumstance should an UNASSIGNED character be
+ permitted in a label to be registered as part of a domain name.
+
+3.2. Registration Policy
+
+ While these recommendations cannot and should not define registry
+ policies, registries should develop and apply additional restrictions
+ as needed to reduce confusion and other problems. For example, it is
+ generally believed that labels containing characters from more than
+ one script are a bad practice although there may be some important
+ exceptions to that principle. Some registries may choose to restrict
+ registrations to characters drawn from a very small number of
+ scripts. For many scripts, the use of variant techniques such as
+ those as described in the JET specification for the CJK script
+ [RFC3743] and its generalization [RFC4290], and illustrated for
+ Chinese by the tables provided by the Chinese Domain Name Consortium
+ [RFC4713] may be helpful in reducing problems that might be perceived
+ by users.
+
+ In general, users will benefit if registries only permit characters
+ from scripts that are well-understood by the registry or its
+ advisers. If a registry decides to reduce opportunities for
+ confusion by constructing policies that disallow characters used in
+ historic writing systems or characters whose use is restricted to
+ specialized, highly technical contexts, some relevant information may
+ be found in Section 2.4 (Specific Character Adjustments) of Unicode
+ Identifier and Pattern Syntax [Unicode-UAX31], especially Table 4
+ (Candidate Characters for Exclusion from Identifiers), and Section
+ 3.1 (General Security Profile for Identifiers) in Unicode Security
+ Mechanisms [Unicode-UTS39].
+
+ The requirement (in Section 4.1 of the Protocol document [RFC5891])
+ that registration procedures use only U-labels and/or A-labels is
+ intended to ensure that registrants are fully aware of exactly what
+ is being registered as well as encouraging use of those canonical
+ forms. That provision should not be interpreted as requiring that
+ registrants need to provide characters in a particular code sequence.
+ Registrant input conventions and management are part of registrant-
+ registrar interactions and relationships between registries and
+ registrars and are outside the scope of these standards.
+
+ It is worth stressing that these principles of policy development and
+ application apply at all levels of the DNS, not only, e.g., top level
+ domain (TLD) or second level domain (SLD) registrations. Even a
+ trivial, "anything is permitted that is valid under the protocol"
+ policy is helpful in that it helps users and application developers
+ know what to expect.
+
+
+
+Klensin Informational [Page 14]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+3.3. Layered Restrictions: Tables, Context, Registration, and
+ Applications
+
+ The character rules in IDNA2008 are based on the realization that
+ there is no single magic bullet for any of the security,
+ confusability, or other issues associated with IDNs. Instead, the
+ specifications define a variety of approaches. The character tables
+ are the first mechanism, protocol rules about how those characters
+ are applied or restricted in context are the second, and those two in
+ combination constitute the limits of what can be done in the
+ protocol. As discussed in the previous section (Section 3.2),
+ registries are expected to restrict what they permit to be
+ registered, devising and using rules that are designed to optimize
+ the balance between confusion and risk on the one hand and maximum
+ expressiveness in mnemonics on the other.
+
+ In addition, there is an important role for user interface programs
+ in warning against label forms that appear problematic given their
+ knowledge of local contexts and conventions. Of course, no approach
+ based on naming or identifiers alone can protect against all threats.
+
+4. Application-Related Issues
+
+4.1. Display and Network Order
+
+ Domain names are always transmitted in network order (the order in
+ which the code points are sent in protocols), but they may have a
+ different display order (the order in which the code points are
+ displayed on a screen or paper). When a domain name contains
+ characters that are normally written right to left, display order may
+ be affected although network order is not. It gets even more
+ complicated if left-to-right and right-to-left labels are adjacent to
+ each other within a domain name. The decision about the display
+ order is ultimately under the control of user agents -- including Web
+ browsers, mail clients, hosted Web applications and many more --
+ which may be highly localized. Should a domain name abc.def, in
+ which both labels are represented in scripts that are written right
+ to left, be displayed as fed.cba or cba.fed? Applications that are
+ in deployment today are already diverse, and one can find examples of
+ either choice.
+
+ The picture changes once again when an IDN appears in an
+ Internationalized Resource Identifier (IRI) [RFC3987]. An IRI or
+ internationalized email address contains elements other than the
+ domain name. For example, IRIs contain protocol identifiers and
+ field delimiter syntax such as "http://" or "mailto:" while email
+ addresses contain the "@" to separate local parts from domain names.
+
+
+
+
+Klensin Informational [Page 15]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ An IRI in network order begins with "http://" followed by domain
+ labels in network order, thus "http://abc.def".
+
+ User interface programs are not required to display and allow input
+ of IRIs directly but often do so. Implementers have to choose
+ whether the overall direction of these strings will always be left to
+ right (or right to left) for an IRI or email address. The natural
+ order for a user typing a domain name on a right-to-left system is
+ fed.cba. Should the right-to-left (RTL) user interface reverse the
+ entire domain name each time a domain name is typed? Does this
+ change if the user types "http://" right before typing a domain name,
+ thus implying that the user is beginning at the beginning of the
+ network-order IRI? Experience in the 1980s and 1990s with mixing
+ systems in which domain name labels were read in network order (left
+ to right) and those in which those labels were read right to left
+ would predict a great deal of confusion.
+
+ If each implementation of each application makes its own decisions on
+ these issues, users will develop heuristics that will sometimes fail
+ when switching applications. However, while some display order
+ conventions, voluntarily adopted, would be desirable to reduce
+ confusion, such suggestions are beyond the scope of these
+ specifications.
+
+4.2. Entry and Display in Applications
+
+ Applications can accept and display domain names using any character
+ set or character coding system. The IDNA protocol does not
+ necessarily affect the interface between users and applications. An
+ IDNA-aware application can accept and display internationalized
+ domain names in two formats: as the internationalized character
+ set(s) supported by the application (i.e., an appropriate local
+ representation of a U-label) and as an A-label. Applications may
+ allow the display of A-labels, but are encouraged not to do so except
+ as an interface for special purposes, possibly for debugging, or to
+ cope with display limitations. In general, they should allow, but
+ not encourage, user input of A-labels. A-labels are opaque and ugly,
+ and malicious variations on them are not easily detected by users.
+ Where possible, they should thus only be exposed when they are
+ absolutely needed. Because IDN labels can be rendered either as
+ A-labels or U-labels, the application may reasonably have an option
+ for the user to select the preferred method of display. Rendering
+ the U-label should normally be the default.
+
+ Domain names are often stored and transported in many places. For
+ example, they are part of documents such as mail messages and web
+ pages. They are transported in many parts of many protocols, such as
+ both the control commands of SMTP and associated message body parts,
+
+
+
+Klensin Informational [Page 16]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ and in the headers and the body content in HTTP. It is important to
+ remember that domain names appear both in domain name slots and in
+ the content that is passed over protocols, and it would be helpful if
+ protocols explicitly define what their domain name slots are.
+
+ In protocols and document formats that define how to handle
+ specification or negotiation of charsets, labels can be encoded in
+ any charset allowed by the protocol or document format. If a
+ protocol or document format only allows one charset, the labels must
+ be given in that charset. Of course, not all charsets can properly
+ represent all labels. If a U-label cannot be displayed in its
+ entirety, the only choice (without loss of information) may be to
+ display the A-label.
+
+ Where a protocol or document format allows IDNs, labels should be in
+ whatever character encoding and escape mechanism the protocol or
+ document format uses in the local environment. This provision is
+ intended to prevent situations in which, e.g., UTF-8 domain names
+ appear embedded in text that is otherwise in some other character
+ coding.
+
+ All protocols that use domain name slots (see Section 2.3.2.6 in the
+ Definitions document [RFC5890]) already have the capacity for
+ handling domain names in the ASCII charset. Thus, A-labels can
+ inherently be handled by those protocols.
+
+ IDNA2008 does not specify required mappings between one character or
+ code point and others. An extended discussion of mapping issues
+ appears in Section 6 and specific recommendations appear in the
+ Mapping document [IDNA2008-Mapping]. In general, IDNA2008 prohibits
+ characters that would be mapped to others by normalization or other
+ rules. As examples, while mathematical characters based on Latin
+ ones are accepted as input to IDNA2003, they are prohibited in
+ IDNA2008. Similarly, uppercase characters, double-width characters,
+ and other variations are prohibited as IDNA input although mapping
+ them as needed in user interfaces is strongly encouraged.
+
+ Since the rules in the Tables document [RFC5892] have the effect that
+ only strings that are not transformed by NFKC are valid, if an
+ application chooses to perform NFKC normalization before lookup, that
+ operation is safe since this will never make the application unable
+ to look up any valid string. However, as discussed above, the
+ application cannot guarantee that any other application will perform
+ that mapping, so it should be used only with caution and for informed
+ users.
+
+
+
+
+
+
+Klensin Informational [Page 17]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ In many cases, these prohibitions should have no effect on what the
+ user can type as input to the lookup process. It is perfectly
+ reasonable for systems that support user interfaces to perform some
+ character mapping that is appropriate to the local environment. This
+ would normally be done prior to actual invocation of IDNA. At least
+ conceptually, the mapping would be part of the Unicode conversions
+ discussed above and in the Protocol document [RFC5891]. However,
+ those changes will be local ones only -- local to environments in
+ which users will clearly understand that the character forms are
+ equivalent. For use in interchanges among systems, it appears to be
+ much more important that U-labels and A-labels can be mapped back and
+ forth without loss of information.
+
+ One specific, and very important, instance of this strategy arises
+ with case folding. In the ASCII-only DNS, names are looked up and
+ matched in a case-independent way, but no actual case folding occurs.
+ Names can be placed in the DNS in either uppercase or lowercase form
+ (or any mixture of them) and that form is preserved, returned in
+ queries, and so on. IDNA2003 approximated that behavior for
+ non-ASCII strings by performing case folding at registration time
+ (resulting in only lowercase IDNs in the DNS) and when names were
+ looked up.
+
+ As suggested earlier in this section, it appears to be desirable to
+ do as little character mapping as possible as long as Unicode works
+ correctly (e.g., Normalization Form C (NFC) mapping to resolve
+ different codings for the same character is still necessary although
+ the specifications require that it be performed prior to invoking the
+ protocol) in order to make the mapping between A-labels and U-labels
+ idempotent. Case mapping is not an exception to this principle. If
+ only lowercase characters can be registered in the DNS (i.e., be
+ present in a U-label), then IDNA2008 should prohibit uppercase
+ characters as input even though user interfaces to applications
+ should probably map those characters. Some other considerations
+ reinforce this conclusion. For example, in ASCII case mapping for
+ individual characters, uppercase(character) is always equal to
+ uppercase(lowercase(character)). That may not be true with IDNs. In
+ some scripts that use case distinctions, there are a few characters
+ that do not have counterparts in one case or the other. The
+ relationship between uppercase and lowercase may even be language-
+ dependent, with different languages (or even the same language in
+ different areas) expecting different mappings. User interface
+ programs can meet the expectations of users who are accustomed to the
+ case-insensitive DNS environment by performing case folding prior to
+ IDNA processing, but the IDNA procedures themselves should neither
+ require such mapping nor expect them when they are not natural to the
+ localized environment.
+
+
+
+
+Klensin Informational [Page 18]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+4.3. Linguistic Expectations: Ligatures, Digraphs, and Alternate
+ Character Forms
+
+ Users have expectations about character matching or equivalence that
+ are based on their own languages and the orthography of those
+ languages. These expectations may not always be met in a global
+ system, especially if multiple languages are written using the same
+ script but using different conventions. Some examples:
+
+ o A Norwegian user might expect a label with the ae-ligature to be
+ treated as the same label as one using the Swedish spelling with
+ a-diaeresis even though applying that mapping to English would be
+ astonishing to users.
+
+ o A German user might expect a label with an o-umlaut and a label
+ that had "oe" substituted, but was otherwise the same, to be
+ treated as equivalent even though that substitution would be a
+ clear error in Swedish.
+
+ o A Chinese user might expect automatic matching of Simplified and
+ Traditional Chinese characters, but applying that matching for
+ Korean or Japanese text would create considerable confusion.
+
+ o An English user might expect "theater" and "theatre" to match.
+
+ A number of languages use alphabetic scripts in which single phonemes
+ are written using two characters, termed a "digraph", for example,
+ the "ph" in "pharmacy" and "telephone". (Such characters can also
+ appear consecutively without forming a digraph, as in "tophat".)
+ Certain digraphs may be indicated typographically by setting the two
+ characters closer together than they would be if used consecutively
+ to represent different phonemes. Some digraphs are fully joined as
+ ligatures. For example, the word "encyclopaedia" is sometimes set
+ with a U+00E6 LATIN SMALL LIGATURE AE. When ligature and digraph
+ forms have the same interpretation across all languages that use a
+ given script, application of Unicode normalization generally resolves
+ the differences and causes them to match. When they have different
+ interpretations, matching must utilize other methods, presumably
+ chosen at the registry level, or users must be educated to understand
+ that matching will not occur.
+
+ The nature of the problem can be illustrated by many words in the
+ Norwegian language, where the "ae" ligature is the 27th letter of a
+ 29-letter extended Latin alphabet. It is equivalent to the 28th
+ letter of the Swedish alphabet (also containing 29 letters),
+ U+00E4 LATIN SMALL LETTER A WITH DIAERESIS, for which an "ae" cannot
+ be substituted according to current orthographic standards. That
+ character (U+00E4) is also part of the German alphabet where, unlike
+
+
+
+Klensin Informational [Page 19]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ in the Nordic languages, the two-character sequence "ae" is usually
+ treated as a fully acceptable alternate orthography for the "umlauted
+ a" character. The inverse is however not true, and those two
+ characters cannot necessarily be combined into an "umlauted a". This
+ also applies to another German character, the "umlauted o"
+ (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) which, for example,
+ cannot be used for writing the name of the author "Goethe". It is
+ also a letter in the Swedish alphabet where, like the "a with
+ diaeresis", it cannot be correctly represented as "oe" and in the
+ Norwegian alphabet, where it is represented, not as "o with
+ diaeresis", but as "slashed o", U+00F8.
+
+ Some of the ligatures that have explicit code points in Unicode were
+ given special handling in IDNA2003 and now pose additional problems
+ in transition. See Section 7.2.
+
+ Additional cases with alphabets written right to left are described
+ in Section 4.5.
+
+ Matching and comparison algorithm selection often requires
+ information about the language being used, context, or both --
+ information that is not available to IDNA or the DNS. Consequently,
+ IDNA2008 makes no attempt to treat combined characters in any special
+ way. A registry that is aware of the language context in which
+ labels are to be registered, and where that language sometimes (or
+ always) treats the two-character sequences as equivalent to the
+ combined form, should give serious consideration to applying a
+ "variant" model [RFC3743][RFC4290] or to prohibiting registration of
+ one of the forms entirely, to reduce the opportunities for user
+ confusion and fraud that would result from the related strings being
+ registered to different parties.
+
+4.4. Case Mapping and Related Issues
+
+ In the DNS, ASCII letters are stored with their case preserved.
+ Matching during the query process is case-independent, but none of
+ the information that might be represented by choices of case has been
+ lost. That model has been accidentally helpful because, as people
+ have created DNS labels by catenating words (or parts of words) to
+ form labels, case has often been used to distinguish among components
+ and make the labels more memorable.
+
+ Since DNS servers do not get involved in parsing IDNs, they cannot do
+ case-independent matching. Thus, keeping the cases separate in
+ lookup or registration, and doing matching at the server, is not
+ feasible with IDNA or any similar approach. Matching of characters
+ that are considered to differ only by case must be done, if desired,
+ by programs invoking IDNA lookup even though it wasn't done by ASCII-
+
+
+
+Klensin Informational [Page 20]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ only DNS clients. That situation was recognized in IDNA2003 and
+ nothing in IDNA2008 fundamentally changes it or could do so. In
+ IDNA2003, all characters are case folded and mapped by clients in a
+ standardized step.
+
+ Even in scripts that generally support case distinctions, some
+ characters do not have uppercase forms. For example, the Unicode
+ case-folding operation maps Greek Final Form Sigma (U+03C2) to the
+ medial form (U+03C3) and maps Eszett (German Sharp S, U+00DF) to
+ "ss". Neither of these mappings is reversible because the uppercase
+ of U+03C3 is the uppercase Sigma (U+03A3) and "ss" is an ASCII
+ string. IDNA2008 permits, at the risk of some incompatibility,
+ slightly more flexibility in this area by avoiding case folding and
+ treating these characters as themselves. Approaches to handling one-
+ way mappings are discussed in Section 7.2.
+
+ Because IDNA2003 maps Final Sigma and Eszett to other characters, and
+ the reverse mapping is never possible, neither Final Sigma nor Eszett
+ can be represented in the ACE form of IDNA2003 IDN nor in the native
+ character (U-label) form derived from it. With IDNA2008, both
+ characters can be used in an IDN and so the A-label used for lookup
+ for any U-label containing those characters is now different. See
+ Section 7.1 for a discussion of what kinds of changes might require
+ the IDNA prefix to change; after extended discussions, the IDNABIS
+ Working Group came to consensus that the change for these characters
+ did not justify a prefix change.
+
+4.5. Right-to-Left Text
+
+ In order to be sure that the directionality of right-to-left text is
+ unambiguous, IDNA2003 required that any label in which right-to-left
+ characters appear both starts and ends with them and that it does not
+ include any characters with strong left-to-right properties (that
+ excludes other alphabetic characters but permits European digits).
+ Any other string that contains a right-to-left character and does not
+ meet those requirements is rejected. This is one of the few places
+ where the IDNA algorithms (both in IDNA2003 and in IDNA2008) examine
+ an entire label, not just individual characters. The algorithmic
+ model used in IDNA2003 rejects the label when the final character in
+ a right-to-left string requires a combining mark in order to be
+ correctly represented.
+
+ That prohibition is not acceptable for writing systems for languages
+ written with consonantal alphabets to which diacritical vocalic
+ systems are applied, and for languages with orthographies derived
+ from them where the combining marks may have different functionality.
+ In both cases, the combining marks can be essential components of the
+ orthography. Examples of this are Yiddish, written with an extended
+
+
+
+Klensin Informational [Page 21]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ Hebrew script, and Dhivehi (the official language of Maldives), which
+ is written in the Thaana script (which is, in turn, derived from the
+ Arabic script). IDNA2008 removes the restriction on final combining
+ characters with a new set of rules for right-to-left scripts and
+ their characters. Those new rules are specified in the Bidi document
+ [RFC5893].
+
+5. IDNs and the Robustness Principle
+
+ The "Robustness Principle" is often stated as "Be conservative about
+ what you send and liberal in what you accept" (see, e.g., Section
+ 1.2.2 of the applications-layer Host Requirements specification
+ [RFC1123]). This principle applies to IDNA. In applying the
+ principle to registries as the source ("sender") of all registered
+ and useful IDNs, registries are responsible for being conservative
+ about what they register and put out in the Internet. For IDNs to
+ work well, zone administrators (registries) must have and require
+ sensible policies about what is registered -- conservative policies
+ -- and implement and enforce them.
+
+ Conversely, lookup applications are expected to reject labels that
+ clearly violate global (protocol) rules (no one has ever seriously
+ claimed that being liberal in what is accepted requires being
+ stupid). However, once one gets past such global rules and deals
+ with anything sensitive to script or locale, it is necessary to
+ assume that garbage has not been placed into the DNS, i.e., one must
+ be liberal about what one is willing to look up in the DNS rather
+ than guessing about whether it should have been permitted to be
+ registered.
+
+ If a string cannot be successfully found in the DNS after the lookup
+ processing described here, it makes no difference whether it simply
+ wasn't registered or was prohibited by some rule at the registry.
+ Application implementers should be aware that where DNS wildcards are
+ used, the ability to successfully resolve a name does not guarantee
+ that it was actually registered.
+
+6. Front-end and User Interface Processing for Lookup
+
+ Domain names may be identified and processed in many contexts. They
+ may be typed in by users themselves or embedded in an identifier such
+ as an email address, URI, or IRI. They may occur in running text or
+ be processed by one system after being provided in another. Systems
+ may try to normalize URLs to determine (or guess) whether a reference
+ is valid or if two references point to the same object without
+ actually looking the objects up (comparison without lookup is
+ necessary for URI types that are not intended to be resolved). Some
+ of these goals may be more easily and reliably satisfied than others.
+
+
+
+Klensin Informational [Page 22]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ While there are strong arguments for any domain name that is placed
+ "on the wire" -- transmitted between systems -- to be in the zero-
+ ambiguity forms of A-labels, it is inevitable that programs that
+ process domain names will encounter U-labels or variant forms.
+
+ An application that implements the IDNA protocol [RFC5891] will
+ always take any user input and convert it to a set of Unicode code
+ points. That user input may be acquired by any of several different
+ input methods, all with differing conversion processes to be taken
+ into consideration (e.g., typed on a keyboard, written by hand onto
+ some sort of digitizer, spoken into a microphone and interpreted by a
+ speech-to-text engine, etc.). The process of taking any particular
+ user input and mapping it into a Unicode code point may be a simple
+ one: if a user strikes the "A" key on a US English keyboard, without
+ any modifiers such as the "Shift" key held down, in order to draw a
+ Latin small letter A ("a"), many (perhaps most) modern operating
+ system input methods will produce to the calling application the code
+ point U+0061, encoded in a single octet.
+
+ Sometimes the process is somewhat more complicated: a user might
+ strike a particular set of keys to represent a combining macron
+ followed by striking the "A" key in order to draw a Latin small
+ letter A with a macron above it. Depending on the operating system,
+ the input method chosen by the user, and even the parameters with
+ which the application communicates with the input method, the result
+ might be the code point U+0101 (encoded as two octets in UTF-8 or
+ UTF-16, four octets in UTF-32, etc.), the code point U+0061 followed
+ by the code point U+0304 (again, encoded in three or more octets,
+ depending upon the encoding used) or even the code point U+FF41
+ followed by the code point U+0304 (and encoded in some form). These
+ examples leave aside the issue of operating systems and input methods
+ that do not use Unicode code points for their character set.
+
+ In every case, applications (with the help of the operating systems
+ on which they run and the input methods used) need to perform a
+ mapping from user input into Unicode code points.
+
+ IDNA2003 used a model whereby input was taken from the user, mapped
+ (via whatever input method mechanisms were used) to a set of Unicode
+ code points, and then further mapped to a set of Unicode code points
+ using the Nameprep profile [RFC3491]. In this procedure, there are
+ two separate mapping steps: first, a mapping done by the input method
+ (which might be controlled by the operating system, the application,
+ or some combination) and then a second mapping performed by the
+ Nameprep portion of the IDNA protocol. The mapping done in Nameprep
+ includes a particular mapping table to re-map some characters to
+ other characters, a particular normalization, and a set of prohibited
+ characters.
+
+
+
+Klensin Informational [Page 23]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ Note that the result of the two-step mapping process means that the
+ mapping chosen by the operating system or application in the first
+ step might differ significantly from the mapping supplied by the
+ Nameprep profile in the second step. This has advantages and
+ disadvantages. Of course, the second mapping regularizes what gets
+ looked up in the DNS, making for better interoperability between
+ implementations that use the Nameprep mapping. However, the
+ application or operating system may choose mappings in their input
+ methods, which when passed through the second (Nameprep) mapping
+ result in characters that are "surprising" to the end user.
+
+ The other important feature of IDNA2003 is that, with very few
+ exceptions, it assumes that any set of Unicode code points provided
+ to the Nameprep mapping can be mapped into a string of Unicode code
+ points that are "sensible", even if that means mapping some code
+ points to nothing (that is, removing the code points from the
+ string). This allowed maximum flexibility in input strings.
+
+ The present version of IDNA (IDNA2008) differs significantly in
+ approach from the original version. First and foremost, it does not
+ provide explicit mapping instructions. Instead, it assumes that the
+ application (perhaps via an operating system input method) will do
+ whatever mapping it requires to convert input into Unicode code
+ points. This has the advantage of giving flexibility to the
+ application to choose a mapping that is suitable for its user given
+ specific user requirements, and avoids the two-step mapping of the
+ original protocol. Instead of a mapping, IDNA2008 provides a set of
+ categories that can be used to specify the valid code points allowed
+ in a domain name.
+
+ In principle, an application ought to take user input of a domain
+ name and convert it to the set of Unicode code points that represent
+ the domain name the user intends. As a practical matter, of course,
+ determining user intent is a tricky business, so an application needs
+ to choose a reasonable mapping from user input. That may differ
+ based on the particular circumstances of a user, depending on locale,
+ language, type of input method, etc. It is up to the application to
+ make a reasonable choice.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Klensin Informational [Page 24]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+7. Migration from IDNA2003 and Unicode Version Synchronization
+
+7.1. Design Criteria
+
+ As mentioned above and in the IAB review and recommendations for IDNs
+ [RFC4690], two key goals of the IDNA2008 design are:
+
+ o to enable applications to be agnostic about whether they are being
+ run in environments supporting any Unicode version from 3.2
+ onward.
+
+ o to permit incrementally adding new characters, character groups,
+ scripts, and other character collections as they are incorporated
+ into Unicode, doing so without disruption and, in the long term,
+ without "heavy" processes (an IETF consensus process is required
+ by the IDNA2008 specifications and is expected to be required and
+ used until significant experience accumulates with IDNA operations
+ and new versions of Unicode).
+
+7.1.1. Summary and Discussion of IDNA Validity Criteria
+
+ The general criteria for a label to be considered valid under IDNA
+ are (the actual rules are rigorously defined in the Protocol
+ [RFC5891] and Tables [RFC5892] documents):
+
+ o The characters are "letters", marks needed to form letters,
+ numerals, or other code points used to write words in some
+ language. Symbols, drawing characters, and various notational
+ characters are intended to be permanently excluded. There is no
+ evidence that they are important enough to Internet operations or
+ internationalization to justify expansion of domain names beyond
+ the general principle of "letters, digits, and hyphen".
+ (Additional discussion and rationale for the symbol decision
+ appears in Section 7.6.)
+
+ o Other than in very exceptional cases, e.g., where they are needed
+ to write substantially any word of a given language, punctuation
+ characters are excluded. The fact that a word exists is not proof
+ that it should be usable in a DNS label, and DNS labels are not
+ expected to be usable for multiple-word phrases (although they are
+ certainly not prohibited if the conventions and orthography of a
+ particular language cause that to be possible).
+
+ o Characters that are unassigned (have no character assignment at
+ all) in the version of Unicode being used by the registry or
+ application are not permitted, even on lookup. The issues
+ involved in this decision are discussed in Section 7.7.
+
+
+
+
+Klensin Informational [Page 25]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ o Any character that is mapped to another character by a current
+ version of NFKC is prohibited as input to IDNA (for either
+ registration or lookup). With a few exceptions, this principle
+ excludes any character mapped to another by Nameprep [RFC3491].
+
+ The principles above drive the design of rules that are specified
+ exactly in the Tables document. Those rules identify the characters
+ that are valid under IDNA. The rules themselves are normative, and
+ the tables are derived from them, rather than vice versa.
+
+7.1.2. Labels in Registration
+
+ Any label registered in a DNS zone must be validated -- i.e., the
+ criteria for that label must be met -- in order for applications to
+ work as intended. This principle is not new. For example, since the
+ DNS was first deployed, zone administrators have been expected to
+ verify that names meet "hostname" requirements [RFC0952] where those
+ requirements are imposed by the expected applications. Other
+ applications contexts, such as the later addition of special service
+ location formats [RFC2782] imposed new requirements on zone
+ administrators. For zones that will contain IDNs, support for
+ Unicode version-independence requires restrictions on all strings
+ placed in the zone. In particular, for such zones (the exact rules
+ appear in Section 4 of the Protocol document [RFC5891]):
+
+ o Any label that appears to be an A-label, i.e., any label that
+ starts in "xn--", must be valid under IDNA, i.e., they must be
+ valid A-labels, as discussed in Section 2 above.
+
+ o The Unicode tables (i.e., tables of code points, character
+ classes, and properties) and IDNA tables (i.e., tables of
+ contextual rules such as those that appear in the Tables
+ document), must be consistent on the systems performing or
+ validating labels to be registered. Note that this does not
+ require that tables reflect the latest version of Unicode, only
+ that all tables used on a given system are consistent with each
+ other.
+
+ Under this model, registry tables will need to be updated (both the
+ Unicode-associated tables and the tables of permitted IDN characters)
+ to enable a new script or other set of new characters. The registry
+ will not be affected by newer versions of Unicode, or newly
+ authorized characters, until and unless it wishes to support them.
+ The zone administrator is responsible for verifying validity for IDNA
+ as well as its local policies -- a more extensive set of checks than
+ are required for looking up the labels. Systems looking up or
+
+
+
+
+
+Klensin Informational [Page 26]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ resolving DNS labels, especially IDN DNS labels, must be able to
+ assume that applicable registration rules were followed for names
+ entered into the DNS.
+
+7.1.3. Labels in Lookup
+
+ Any application processing a label through IDNA so it can be looked
+ up in a DNS zone is required to (the exact rules appear in Section 5
+ of the Protocol document [RFC5891]):
+
+ o Maintain IDNA and Unicode tables that are consistent with regard
+ to versions, i.e., unless the application actually executes the
+ classification rules in the Tables document [RFC5892], its IDNA
+ tables must be derived from the version of Unicode that is
+ supported more generally on the system. As with registration, the
+ tables need not reflect the latest version of Unicode, but they
+ must be consistent.
+
+ o Validate the characters in labels to be looked up only to the
+ extent of determining that the U-label does not contain
+ "DISALLOWED" code points or code points that are unassigned in its
+ version of Unicode.
+
+ o Validate the label itself for conformance with a small number of
+ whole-label rules. In particular, it must verify that:
+
+ * there are no leading combining marks,
+
+ * the Bidi conditions are met if right-to-left characters appear,
+
+ * any required contextual rules are available, and
+
+ * any contextual rules that are associated with joiner characters
+ (and CONTEXTJ characters more generally) are tested.
+
+ o Do not reject labels based on other contextual rules about
+ characters, including mixed-script label prohibitions. Such rules
+ may be used to influence presentation decisions in the user
+ interface, but not to avoid looking up domain names.
+
+ To further clarify the rules about handling characters that require
+ contextual rules, note that one can have a context-required character
+ (i.e., one that requires a rule), but no rule. In that case, the
+ character is treated the same way DISALLOWED characters are treated,
+ until and unless a rule is supplied. That state is more or less
+ equivalent to "the idea of permitting this character is accepted in
+ principle, but it won't be permitted in practice until consensus is
+ reached on a safe way to use it".
+
+
+
+Klensin Informational [Page 27]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ The ability to add a rule more or less exempts these characters from
+ the prohibition against reclassifying characters from DISALLOWED to
+ PVALID.
+
+ And, obviously, "no rule" is different from "have a rule, but the
+ test either succeeds or fails".
+
+ Lookup applications that follow these rules, rather than having their
+ own criteria for rejecting lookup attempts, are not sensitive to
+ version incompatibilities with the particular zone registry
+ associated with the domain name except for labels containing
+ characters recently added to Unicode.
+
+ An application or client that processes names according to this
+ protocol and then resolves them in the DNS will be able to locate any
+ name that is registered, as long as those registrations are valid
+ under IDNA and its version of the IDNA tables is sufficiently up to
+ date to interpret all of the characters in the label. Messages to
+ users should distinguish between "label contains an unallocated code
+ point" and other types of lookup failures. A failure on the basis of
+ an old version of Unicode may lead the user to a desire to upgrade to
+ a newer version, but will have no other ill effects (this is
+ consistent with behavior in the transition to the DNS when some hosts
+ could not yet handle some forms of names or record types).
+
+7.2. Changes in Character Interpretations
+
+ As a consequence of the elimination of mapping, the current version
+ of IDNA changes the interpretation of a few characters relative to
+ its predecessors. This subsection outlines the issues and discusses
+ possible transition strategies.
+
+7.2.1. Character Changes: Eszett and Final Sigma
+
+ In those scripts that make case distinctions, there are a few
+ characters for which an obvious and unique uppercase character has
+ not historically been available to match a lowercase one, or vice
+ versa. For those characters, the mappings used in constructing the
+ Stringprep tables for IDNA2003, performed using the Unicode
+ toCaseFold operation (see Section 5.18 of the Unicode Standard
+ [Unicode52]), generate different characters or sets of characters.
+ Those operations are not reversible and lose even more information
+ than traditional uppercase or lowercase transformations, but are more
+ useful than those transformations for comparison purposes. Two
+ notable characters of this type are the German character Eszett
+ (Sharp S, U+00DF) and the Greek Final Form Sigma (U+03C2). The
+ former is case folded to the ASCII string "ss", the latter to a
+ medial (lowercase) Sigma (U+03C3).
+
+
+
+Klensin Informational [Page 28]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+7.2.2. Character Changes: Zero Width Joiner and Zero Width Non-Joiner
+
+ IDNA2003 mapped both ZERO WIDTH JOINER (ZWJ, U+200D) and ZERO WIDTH
+ NON-JOINER (ZWNJ, U+200C) to nothing, effectively dropping these
+ characters from any label in which they appeared and treating strings
+ containing them as identical to strings that did not. As discussed
+ in Section 3.1.2 above, those characters are essential for writing
+ many reasonable mnemonics for certain scripts. However, treating
+ them as valid in IDNA2008, even with contextual restrictions, raises
+ approximately the same problem as exists with Eszett and Final Sigma:
+ strings that were valid under IDNA2003 have different interpretations
+ as labels, and different A-labels, than the same strings under this
+ newer version.
+
+7.2.3. Character Changes and the Need for Transition
+
+ The decision to eliminate mandatory and standardized mappings,
+ including case folding, from the IDNA2008 protocol in order to make
+ A-labels and U-labels idempotent made these characters problematic.
+ If they were to be disallowed, important words and mnemonics could
+ not be written in orthographically reasonable ways. If they were to
+ be permitted as distinct characters, there would be no information
+ loss and registries would have more flexibility, but IDNA2003 and
+ IDNA2008 lookups might result in different A-labels.
+
+ With the understanding that there would be incompatibility either way
+ but a judgment that the incompatibility was not significant enough to
+ justify a prefix change, the Working Group concluded that Eszett and
+ Final Form Sigma should be treated as distinct and Protocol-Valid
+ characters.
+
+ Since these characters are interpreted in different ways under the
+ older and newer versions of IDNA, transition strategies and policies
+ will be necessary. Some actions can reasonably be taken by
+ applications' client programs (those that perform lookup operations
+ or cause them to be performed), but because of the diversity of
+ situations and uses of the DNS, much of the responsibility will need
+ to fall on registries.
+
+ Registries, especially those maintaining zones for third parties,
+ must decide how to introduce a new service in a way that does not
+ create confusion or significantly weaken or invalidate existing
+ identifiers. This is not a new problem; registries were faced with
+ similar issues when IDNs were introduced (potentially, and especially
+ for Latin-based scripts, in conflict with existing labels that had
+ been rendered in ASCII characters by applying more or less
+ standardized conventions) and when other new forms of strings have
+ been permitted as labels.
+
+
+
+Klensin Informational [Page 29]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+7.2.4. Transition Strategies
+
+ There are several approaches to the introduction of new characters or
+ changes in interpretation of existing characters from their mapped
+ forms in the earlier version of IDNA. The transition issue is
+ complicated because the forms of these labels after the
+ ToUnicode(ToASCII()) translation in IDNA2003 not only remain valid
+ but do not provide strong indications of what the registrant
+ intended: a string containing "ss" could have simply been intended to
+ be that string or could have been intended to contain an Eszett; a
+ string containing lowercase Sigma could have been intended to contain
+ Final Sigma (one might make heuristic guesses based on position in a
+ string, but the long tradition of forming labels by concatenating
+ words makes such heuristics unreliable), and strings that do not
+ contain ZWJ or ZWNJ might have been intended to contain them.
+ Without any preference or claim to completeness, some of these, all
+ of which have been used by registries in the past for similar
+ transitions, are:
+
+ 1. Do not permit use of the newly available character at the
+ registry level. This might cause lookup failures if a domain
+ name were to be written with the expectation of the IDNA2003
+ mapping behavior, but would eliminate any possibility of false
+ matches.
+
+ 2. Hold a "sunrise"-like arrangement in which holders of labels
+ containing "ss" in the Eszett case, lowercase Sigma in that case,
+ or that might have contained ZWJ or ZWNJ in context, are given
+ priority (and perhaps other benefits) for registering the
+ corresponding string containing Eszett, Final Sigma, or the
+ appropriate zero-width character respectively.
+
+ 3. Adopt some sort of "variant" approach in which registrants obtain
+ labels with both character forms.
+
+ 4. Adopt a different form of "variant" approach in which
+ registration of additional strings that would produce the same
+ A-label if interpreted according to IDNA2003 is either not
+ permitted at all or permitted only by the registrant who already
+ has one of the names.
+
+ 5. Ignore the issue and assume that the marketplace or other
+ mechanisms will sort things out.
+
+ In any event, a registry (at any level of the DNS tree) that chooses
+ to permit labels to be registered that contains these characters, or
+ considers doing so, will have to address the relationship with
+ existing, possibly conflicting, labels in some way, just as
+
+
+
+Klensin Informational [Page 30]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ registries that already had a considerable number of labels did when
+ IDNs were first introduced.
+
+7.3. Elimination of Character Mapping
+
+ As discussed at length in Section 6, IDNA2003, via Nameprep (see
+ Section 7.5), mapped many characters into related ones. Those
+ mappings no longer exist as requirements in IDNA2008. These
+ specifications strongly prefer that only A-labels or U-labels be used
+ in protocol contexts and as much as practical more generally.
+ IDNA2008 does anticipate situations in which some mapping at the time
+ of user input into lookup applications is appropriate and desirable.
+ The issues are discussed in Section 6 and specific recommendations
+ are made in the Mapping document [IDNA2008-Mapping].
+
+7.4. The Question of Prefix Changes
+
+ The conditions that would have required a change in the IDNA ACE
+ prefix ("xn--", used in IDNA2003) were of great concern to the
+ community. A prefix change would have clearly been necessary if the
+ algorithms were modified in a manner that would have created serious
+ ambiguities during subsequent transition in registrations. This
+ section summarizes the working group's conclusions about the
+ conditions under which a change in the prefix would have been
+ necessary and the implications of such a change.
+
+7.4.1. Conditions Requiring a Prefix Change
+
+ An IDN prefix change would have been needed if a given string would
+ be looked up or otherwise interpreted differently depending on the
+ version of the protocol or tables being used. This IDNA upgrade
+ would have required a prefix change if, and only if, one of the
+ following four conditions were met:
+
+ 1. The conversion of an A-label to Unicode (i.e., a U-label) would
+ have yielded one string under IDNA2003 and a different string
+ under IDNA2008.
+
+ 2. In a significant number of cases, an input string that was valid
+ under IDNA2003 and also valid under IDNA2008 would have yielded
+ two different A-labels with the different versions. This
+ condition is believed to be essentially equivalent to the one
+ above except for a very small number of edge cases that were not
+ found to justify a prefix change (see Section 7.2).
+
+ Note that if the input string was valid under one version and not
+ valid under the other, this condition would not apply. See the
+ first item in Section 7.4.2, below.
+
+
+
+Klensin Informational [Page 31]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ 3. A fundamental change was made to the semantics of the string that
+ would be inserted in the DNS, e.g., if a decision were made to
+ try to include language or script information in the encoding in
+ addition to the string itself.
+
+ 4. A sufficiently large number of characters were added to Unicode
+ so that the Punycode mechanism for block offsets would no longer
+ reference the higher-numbered planes and blocks. This condition
+ is unlikely even in the long term and certain not to arise in the
+ next several years.
+
+7.4.2. Conditions Not Requiring a Prefix Change
+
+ As a result of the principles described above, none of the following
+ changes required a new prefix:
+
+ 1. Prohibition of some characters as input to IDNA. Such a
+ prohibition might make names that were previously registered
+ inaccessible, but did not change those names.
+
+ 2. Adjustments in IDNA tables or actions, including normalization
+ definitions, that affected characters that were already invalid
+ under IDNA2003.
+
+ 3. Changes in the style of the IDNA definition that did not alter
+ the actions performed by IDNA.
+
+7.4.3. Implications of Prefix Changes
+
+ While it might have been possible to make a prefix change, the costs
+ of such a change are considerable. Registries could not have
+ converted all IDNA2003 ("xn--") registrations to a new form at the
+ same time and synchronize that change with applications supporting
+ lookup. Unless all existing registrations were simply to be declared
+ invalid (and perhaps even then), systems that needed to support both
+ labels with old prefixes and labels with new ones would be required
+ to first process a putative label under the IDNA2008 rules and try to
+ look it up and then, if it were not found, would be required to
+ process the label under IDNA2003 rules and look it up again. That
+ process would probably have significantly slowed down all processing
+ that involved IDNs in the DNS, especially since a fully-qualified
+ name might contain a mixture of labels that were registered with the
+ old and new prefixes. That would have made DNS caching very
+ difficult. In addition, looking up the same input string as two
+ separate A-labels would have created some potential for confusion and
+ attacks, since the labels could map to different targets and then
+ resolve to different entries in the DNS.
+
+
+
+
+Klensin Informational [Page 32]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ Consequently, a prefix change should have been, and was, avoided if
+ at all possible, even if it means accepting some IDNA2003 decisions
+ about character distinctions as irreversible and/or giving special
+ treatment to edge cases.
+
+7.5. Stringprep Changes and Compatibility
+
+ The Nameprep specification [RFC3491], a key part of IDNA2003, is a
+ profile of Stringprep [RFC3454]. While Nameprep is a Stringprep
+ profile specific to IDNA, Stringprep is used by a number of other
+ protocols. Were Stringprep to have been modified by IDNA2008, those
+ changes to improve the handling of IDNs could cause problems for
+ non-DNS uses, most notably if they affected identification and
+ authentication protocols. Several elements of IDNA2008 give
+ interpretations to strings prohibited under IDNA2003 or prohibit
+ strings that IDNA2003 permitted. Those elements include the new
+ inclusion information in the Tables document [RFC5892], the reduction
+ in the number of characters permitted as input for registration or
+ lookup (Section 3), and even the changes in handling of right-to-left
+ strings as described in the Bidi document [RFC5893]. IDNA2008 does
+ not use Nameprep or Stringprep at all, so there are no side-effect
+ changes to other protocols.
+
+ It is particularly important to keep IDNA processing separate from
+ processing for various security protocols because some of the
+ constraints that are necessary for smooth and comprehensible use of
+ IDNs may be unwanted or undesirable in other contexts. For example,
+ the criteria for good passwords or passphrases are very different
+ from those for desirable IDNs: passwords should be hard to guess,
+ while domain names should normally be easily memorable. Similarly,
+ internationalized Small Computer System Interface (SCSI) identifiers
+ and other protocol components are likely to have different
+ requirements than IDNs.
+
+7.6. The Symbol Question
+
+ One of the major differences between this specification and the
+ original version of IDNA is that IDNA2003 permitted non-letter
+ symbols of various sorts, including punctuation and line-drawing
+ symbols, in the protocol. They were always discouraged in practice.
+ In particular, both the "IESG Statement" about IDNA and all versions
+ of the ICANN Guidelines specify that only language characters be used
+ in labels. This specification disallows symbols entirely. There are
+ several reasons for this, which include:
+
+ 1. As discussed elsewhere, the original IDNA specification assumed
+ that as many Unicode characters as possible should be permitted,
+ directly or via mapping to other characters, in IDNs. This
+
+
+
+Klensin Informational [Page 33]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ specification operates on an inclusion model, extrapolating from
+ the original "hostname" rules (LDH, see the Definitions document
+ [RFC5890]) -- which have served the Internet very well -- to a
+ Unicode base rather than an ASCII base.
+
+ 2. Symbol names are more problematic than letters because there may
+ be no general agreement on whether a particular glyph matches a
+ symbol; there are no uniform conventions for naming; variations
+ such as outline, solid, and shaded forms may or may not exist;
+ and so on. As just one example, consider a "heart" symbol as it
+ might appear in a logo that might be read as "I love...". While
+ the user might read such a logo as "I love..." or "I heart...",
+ considerable knowledge of the coding distinctions made in Unicode
+ is needed to know that there is more than one "heart" character
+ (e.g., U+2665, U+2661, and U+2765) and how to describe it. These
+ issues are of particular importance if strings are expected to be
+ understood or transcribed by the listener after being read out
+ loud.
+
+ 3. Design of a screen reader used by blind Internet users who must
+ listen to renderings of IDN domain names and possibly reproduce
+ them on the keyboard becomes considerably more complicated when
+ the names of characters are not obvious and intuitive to anyone
+ familiar with the language in question.
+
+ 4. As a simplified example of this, assume one wanted to use a
+ "heart" or "star" symbol in a label. This is problematic because
+ those names are ambiguous in the Unicode system of naming (the
+ actual Unicode names require far more qualification). A user or
+ would-be registrant has no way to know -- absent careful study of
+ the code tables -- whether it is ambiguous (e.g., where there are
+ multiple "heart" characters) or not. Conversely, the user seeing
+ the hypothetical label doesn't know whether to read it -- try to
+ transmit it to a colleague by voice -- as "heart", as "love", as
+ "black heart", or as any of the other examples below.
+
+ 5. The actual situation is even worse than this. There is no
+ possible way for a normal, casual, user to tell the difference
+ between the hearts of U+2665 and U+2765 and the stars of U+2606
+ and U+2729 without somehow knowing to look for a distinction. We
+ have a white heart (U+2661) and few black hearts. Consequently,
+ describing a label as containing a heart is hopelessly ambiguous:
+ we can only know that it contains one of several characters that
+ look like hearts or have "heart" in their names. In cities where
+ "Square" is a popular part of a location name, one might well
+ want to use a square symbol in a label as well and there are far
+ more squares of various flavors in Unicode than there are hearts
+ or stars.
+
+
+
+Klensin Informational [Page 34]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ The consequence of these ambiguities is that symbols are a very poor
+ basis for reliable communication. Consistent with this conclusion,
+ the Unicode standard recommends that strings used in identifiers not
+ contain symbols or punctuation [Unicode-UAX31]. Of course, these
+ difficulties with symbols do not arise with actual pictographic
+ languages and scripts which would be treated like any other language
+ characters; the two should not be confused.
+
+7.7. Migration between Unicode Versions: Unassigned Code Points
+
+ In IDNA2003, labels containing unassigned code points are looked up
+ on the assumption that, if they appear in labels and can be mapped
+ and then resolved, the relevant standards must have changed and the
+ registry has properly allocated only assigned values.
+
+ In the IDNA2008 protocol, strings containing unassigned code points
+ must not be either looked up or registered. In summary, the status
+ of an unassigned character with regard to the DISALLOWED,
+ PROTOCOL-VALID, and CONTEXTUAL RULE REQUIRED categories cannot be
+ evaluated until a character is actually assigned and known. There
+ are several reasons for this, with the most important ones being:
+
+ o Tests involving the context of characters (e.g., some characters
+ being permitted only adjacent to others of specific types) and
+ integrity tests on complete labels are needed. Unassigned code
+ points cannot be permitted because one cannot determine whether
+ particular code points will require contextual rules (and what
+ those rules should be) before characters are assigned to them and
+ the properties of those characters fully understood.
+
+ o It cannot be known in advance, and with sufficient reliability,
+ whether a newly assigned code point will be associated with a
+ character that would be disallowed by the rules in the Tables
+ document [RFC5892] (such as a compatibility character). In
+ IDNA2003, since there is no direct dependency on NFKC (many of the
+ entries in Stringprep's tables are based on NFKC, but IDNA2003
+ depends only on Stringprep), allocation of a compatibility
+ character might produce some odd situations, but it would not be a
+ problem. In IDNA2008, where compatibility characters are
+ DISALLOWED unless character-specific exceptions are made,
+ permitting strings containing unassigned characters to be looked
+ up would violate the principle that characters in DISALLOWED are
+ not looked up.
+
+ o The Unicode Standard specifies that an unassigned code point
+ normalizes (and, where relevant, case folds) to itself. If the
+ code point is later assigned to a character, and particularly if
+ the newly assigned code point has a combining class that
+
+
+
+Klensin Informational [Page 35]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ determines its placement relative to other combining characters,
+ it could normalize to some other code point or sequence.
+
+ It is possible to argue that the issues above are not important and
+ that, as a consequence, it is better to retain the principle of
+ looking up labels even if they contain unassigned characters because
+ all of the important scripts and characters have been coded as of
+ Unicode 5.2 (or even earlier), and hence unassigned code points will
+ be assigned only to obscure characters or archaic scripts.
+ Unfortunately, that does not appear to be a safe assumption for at
+ least two reasons. First, much the same claim of completeness has
+ been made for earlier versions of Unicode. The reality is that a
+ script that is obscure to much of the world may still be very
+ important to those who use it. Cultural and linguistic preservation
+ principles make it inappropriate to declare the script of no
+ importance in IDNs. Second, we already have counterexamples, e.g.,
+ in the relationships associated with new Han characters being added
+ (whether in the BMP or in Unicode Plane 2).
+
+ Independent of the technical transition issues identified above, it
+ can be observed that any addition of characters to an existing script
+ to make it easier to use or to better accommodate particular
+ languages may lead to transition issues. Such additions may change
+ the preferred form for writing a particular string, changes that may
+ be reflected, e.g., in keyboard transition modules that would
+ necessarily be different from those for earlier versions of Unicode
+ where the newer characters may not exist. This creates an inherent
+ transition problem because attempts to access labels may use either
+ the old or the new conventions, requiring registry action whether or
+ not the older conventions were used in labels. The need to consider
+ transition mechanisms is inherent to evolution of Unicode to better
+ accommodate writing systems and is independent of how IDNs are
+ represented in the DNS or how transitions among versions of those
+ mechanisms occur. The requirement for transitions of this type is
+ illustrated by the addition of Malayalam Chillu in Unicode 5.1.0.
+
+7.8. Other Compatibility Issues
+
+ The 2003 IDNA model includes several odd artifacts of the context in
+ which it was developed. Many, if not all, of these are potential
+ avenues for exploits, especially if the registration process permits
+ "source" names (names that have not been processed through IDNA and
+ Nameprep) to be registered. As one example, since the character
+ Eszett, used in German, is mapped by IDNA2003 into the sequence "ss"
+ rather than being retained as itself or prohibited, a string
+ containing that character, but that is otherwise in ASCII, is not
+ really an IDN (in the U-label sense defined above). After Nameprep
+ maps out the Eszett, the result is an ASCII string and so it does not
+
+
+
+Klensin Informational [Page 36]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ get an xn-- prefix, but the string that can be displayed to a user
+ appears to be an IDN. IDNA2008 eliminates this artifact. A
+ character is either permitted as itself or it is prohibited; special
+ cases that make sense only in a particular linguistic or cultural
+ context can be dealt with as localization matters where appropriate.
+
+8. Name Server Considerations
+
+8.1. Processing Non-ASCII Strings
+
+ Existing DNS servers do not know the IDNA rules for handling
+ non-ASCII forms of IDNs, and therefore need to be shielded from them.
+ All existing channels through which names can enter a DNS server
+ database (for example, master files (as described in RFC 1034) and
+ DNS update messages [RFC2136]) could not be IDNA-aware because they
+ predate IDNA. Other sections of this document provide the needed
+ shielding by ensuring that internationalized domain names entering
+ DNS server databases through such channels have already been
+ converted to their equivalent ASCII A-label forms.
+
+ Because of the distinction made between the algorithms for
+ Registration and Lookup in Sections 4 and 5 (respectively) of the
+ Protocol document [RFC5891] (a domain name containing only ASCII code
+ points cannot be converted to an A-label), there cannot be more than
+ one A-label form for any given U-label.
+
+ As specified in clarifications to the DNS specification [RFC2181],
+ the DNS protocol explicitly allows domain labels to contain octets
+ beyond the ASCII range (0000..007F), and this document does not
+ change that. However, although the interpretation of octets
+ 0080..00FF is well-defined in the DNS, many application protocols
+ support only ASCII labels and there is no defined interpretation of
+ these non-ASCII octets as characters and, in particular, no
+ interpretation of case-independent matching for them (e.g., see the
+ clarification on DNS case insensitivity [RFC4343]). If labels
+ containing these octets are returned to applications, unpredictable
+ behavior could result. The A-label form, which cannot contain those
+ characters, is the only standard representation for internationalized
+ labels in the DNS protocol.
+
+8.2. Root and Other DNS Server Considerations
+
+ IDNs in A-label form will generally be somewhat longer than current
+ domain names, so the bandwidth needed by the root servers is likely
+ to go up by a small amount. Also, queries and responses for IDNs
+ will probably be somewhat longer than typical queries historically,
+
+
+
+
+
+Klensin Informational [Page 37]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ so Extension Mechanisms for DNS (EDNS0) [RFC2671] support may be more
+ important (otherwise, queries and responses may be forced to go to
+ TCP instead of UDP).
+
+9. Internationalization Considerations
+
+ DNS labels and fully-qualified domain names provide mnemonics that
+ assist in identifying and referring to resources on the Internet.
+ IDNs expand the range of those mnemonics to include those based on
+ languages and character sets other than Western European and Roman-
+ derived ones. But domain "names" are not, in general, words in any
+ language. The recommendations of the IETF policy on character sets
+ and languages (BCP 18 [RFC2277]) are applicable to situations in
+ which language identification is used to provide language-specific
+ contexts. The DNS is, by contrast, global and international and
+ ultimately has nothing to do with languages. Adding languages (or
+ similar context) to IDNs generally, or to DNS matching in particular,
+ would imply context-dependent matching in DNS, which would be a very
+ significant change to the DNS protocol itself. It would also imply
+ that users would need to identify the language associated with a
+ particular label in order to look that label up. That knowledge is
+ generally not available because many labels are not words in any
+ language and some may be words in more than one.
+
+10. IANA Considerations
+
+ This section gives an overview of IANA registries required for IDNA.
+ The actual definitions of, and specifications for, the first two,
+ which have been newly created for IDNA2008, appear in the Tables
+ document [RFC5892]. This document describes the registries, but it
+ does not specify any IANA actions.
+
+10.1. IDNA Character Registry
+
+ The distinction among the major categories "UNASSIGNED",
+ "DISALLOWED", "PROTOCOL-VALID", and "CONTEXTUAL RULE REQUIRED" is
+ made by special categories and rules that are integral elements of
+ the Tables document. While not normative, an IANA registry of
+ characters and scripts and their categories, updated for each new
+ version of Unicode and the characters it contains, are convenient for
+ programming and validation purposes. The details of this registry
+ are specified in the Tables document.
+
+
+
+
+
+
+
+
+
+Klensin Informational [Page 38]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+10.2. IDNA Context Registry
+
+ IANA has created and now maintains a list of approved contextual
+ rules for characters that are defined in the IDNA Character Registry
+ list as requiring a Contextual Rule (i.e., the types of rules
+ described in Section 3.1.2). The details for those rules appear in
+ the Tables document.
+
+10.3. IANA Repository of IDN Practices of TLDs
+
+ This registry, historically described as the "IANA Language Character
+ Set Registry" or "IANA Script Registry" (both somewhat misleading
+ terms), is maintained by IANA at the request of ICANN. It is used to
+ provide a central documentation repository of the IDN policies used
+ by top level domain (TLD) registries who volunteer to contribute to
+ it and is used in conjunction with ICANN Guidelines for IDN use.
+
+ It is not an IETF-managed registry and, while the protocol changes
+ specified here may call for some revisions to the tables, IDNA2008
+ has no direct effect on that registry and no IANA action is required
+ as a result.
+
+11. Security Considerations
+
+11.1. General Security Issues with IDNA
+
+ This document is purely explanatory and informational and
+ consequently introduces no new security issues. It would, of course,
+ be a poor idea for someone to try to implement from it; such an
+ attempt would almost certainly lead to interoperability problems and
+ might lead to security ones. A discussion of security issues with
+ IDNA, including some relevant history, appears in the Definitions
+ document [RFC5890].
+
+12. Acknowledgments
+
+ The editor and contributors would like to express their thanks to
+ those who contributed significant early (pre-working group) review
+ comments, sometimes accompanied by text, Paul Hoffman, Simon
+ Josefsson, and Sam Weiler. In addition, some specific ideas were
+ incorporated from suggestions, text, or comments about sections that
+ were unclear supplied by Vint Cerf, Frank Ellerman, Michael Everson,
+ Asmus Freytag, Erik van der Poel, Michel Suignard, and Ken Whistler.
+ Thanks are also due to Vint Cerf, Lisa Dusseault, Debbie Garside, and
+ Jefsey Morfin for conversations that led to considerable improvements
+ in the content of this document and to several others, including Ben
+
+
+
+
+
+Klensin Informational [Page 39]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ Campbell, Martin Duerst, Subramanian Moonesamy, Peter Saint-Andre,
+ and Dan Winship, for catching specific errors and recommending
+ corrections.
+
+ A meeting was held on 30 January 2008 to attempt to reconcile
+ differences in perspective and terminology about this set of
+ specifications between the design team and members of the Unicode
+ Technical Consortium. The discussions at and subsequent to that
+ meeting were very helpful in focusing the issues and in refining the
+ specifications. The active participants at that meeting were (in
+ alphabetic order, as usual) Harald Alvestrand, Vint Cerf, Tina Dam,
+ Mark Davis, Lisa Dusseault, Patrik Faltstrom (by telephone), Cary
+ Karp, John Klensin, Warren Kumari, Lisa Moore, Erik van der Poel,
+ Michel Suignard, and Ken Whistler. We express our thanks to Google
+ for support of that meeting and to the participants for their
+ contributions.
+
+ Useful comments and text on the working group versions of the working
+ draft were received from many participants in the IETF "IDNABIS"
+ working group and a number of document changes resulted from mailing
+ list discussions made by that group. Marcos Sanz provided specific
+ analysis and suggestions that were exceptionally helpful in refining
+ the text, as did Vint Cerf, Martin Duerst, Andrew Sullivan, and Ken
+ Whistler. Lisa Dusseault provided extensive editorial suggestions
+ during the spring of 2009, most of which were incorporated.
+
+13. Contributors
+
+ While the listed editor held the pen, the core of this document and
+ the initial working group version represents the joint work and
+ conclusions of an ad hoc design team consisting of the editor and, in
+ alphabetic order, Harald Alvestrand, Tina Dam, Patrik Faltstrom, and
+ Cary Karp. Considerable material describing mapping principles has
+ been incorporated from a draft of the Mapping document
+ [IDNA2008-Mapping] by Pete Resnick and Paul Hoffman. In addition,
+ there were many specific contributions and helpful comments from
+ those listed in the Acknowledgments section and others who have
+ contributed to the development and use of the IDNA protocols.
+
+14. References
+
+14.1. Normative References
+
+ [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
+ "Internationalizing Domain Names in Applications
+ (IDNA)", RFC 3490, March 2003.
+
+
+
+
+
+Klensin Informational [Page 40]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ [RFC3492] Costello, A., "Punycode: A Bootstring encoding of
+ Unicode for Internationalized Domain Names in
+ Applications (IDNA)", RFC 3492, March 2003.
+
+ [RFC5890] Klensin, J., "Internationalized Domain Names for
+ Applications (IDNA): Definitions and Document
+ Framework", RFC 5890, August 2010.
+
+ [RFC5891] Klensin, J., "Internationalized Domain Names in
+ Applications (IDNA): Protocol", RFC 5891, August 2010.
+
+ [RFC5892] Faltstrom, P., "The Unicode Code Points and
+ Internationalized Domain Names for Applications (IDNA)",
+ RFC 5892, August 2010.
+
+ [RFC5893] Alvestrand, H. and C. Karp, "Right-to-Left Scripts for
+ Internationalized Domain Names for Applications (IDNA)",
+ RFC 5893, August 2010.
+
+ [Unicode52] The Unicode Consortium. The Unicode Standard, Version
+ 5.2.0, defined by: "The Unicode Standard, Version
+ 5.2.0", (Mountain View, CA: The Unicode Consortium,
+ 2009. ISBN 978-1-936213-00-9).
+ <http://www.unicode.org/versions/Unicode5.2.0/>.
+
+14.2. Informative References
+
+ [IDNA2008-Mapping]
+ Resnick, P. and P. Hoffman, "Mapping Characters in
+ Internationalized Domain Names for Applications (IDNA)",
+ Work in Progress, April 2010.
+
+ [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD
+ Internet host table specification", RFC 952,
+ October 1985.
+
+ [RFC1034] Mockapetris, P., "Domain names - concepts and
+ facilities", STD 13, RFC 1034, November 1987.
+
+ [RFC1035] Mockapetris, P., "Domain names - implementation and
+ specification", STD 13, RFC 1035, November 1987.
+
+ [RFC1123] Braden, R., "Requirements for Internet Hosts -
+ Application and Support", STD 3, RFC 1123, October 1989.
+
+ [RFC2136] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound,
+ "Dynamic Updates in the Domain Name System (DNS
+ UPDATE)", RFC 2136, April 1997.
+
+
+
+Klensin Informational [Page 41]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS
+ Specification", RFC 2181, July 1997.
+
+ [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
+ Languages", BCP 18, RFC 2277, January 1998.
+
+ [RFC2671] Vixie, P., "Extension Mechanisms for DNS (EDNS0)",
+ RFC 2671, August 1999.
+
+ [RFC2782] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for
+ specifying the location of services (DNS SRV)",
+ RFC 2782, February 2000.
+
+ [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
+ Internationalized Strings ("stringprep")", RFC 3454,
+ December 2002.
+
+ [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
+ Profile for Internationalized Domain Names (IDN)",
+ RFC 3491, March 2003.
+
+ [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
+ Engineering Team (JET) Guidelines for Internationalized
+ Domain Names (IDN) Registration and Administration for
+ Chinese, Japanese, and Korean", RFC 3743, April 2004.
+
+ [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
+ Identifiers (IRIs)", RFC 3987, January 2005.
+
+ [RFC4290] Klensin, J., "Suggested Practices for Registration of
+ Internationalized Domain Names (IDN)", RFC 4290,
+ December 2005.
+
+ [RFC4343] Eastlake, D., "Domain Name System (DNS) Case
+ Insensitivity Clarification", RFC 4343, January 2006.
+
+ [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review
+ and Recommendations for Internationalized Domain Names
+ (IDNs)", RFC 4690, September 2006.
+
+ [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin,
+ "Registration and Administration Recommendations for
+ Chinese Domain Names", RFC 4713, October 2006.
+
+
+
+
+
+
+
+
+Klensin Informational [Page 42]
+
+RFC 5894 IDNA Rationale August 2010
+
+
+ [Unicode-UAX31]
+ The Unicode Consortium, "Unicode Standard Annex #31:
+ Unicode Identifier and Pattern Syntax, Revision 11",
+ September 2009,
+ <http://www.unicode.org/reports/tr31/tr31-11.html>.
+
+ [Unicode-UTS39]
+ The Unicode Consortium, "Unicode Technical Standard #39:
+ Unicode Security Mechanisms, Revision 2", August 2006,
+ <http://www.unicode.org/reports/tr39/tr39-2.html>.
+
+Author's Address
+
+ John C Klensin
+ 1770 Massachusetts Ave, Ste 322
+ Cambridge, MA 02140
+ USA
+
+ Phone: +1 617 245 1457
+ EMail: john+ietf@jck.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Klensin Informational [Page 43]
+